Apache Hudi - 1 - quick-start-guide 快速入门

quick-start-guide

    • 前言
    • spark shell 配置
      • 启动pyspark的hudi环境报错
      • 成功启动带hudi的pyspark
    • IDEA 代码方式
      • 插入数据(表不存在则新建表,upsert)
      • 查询数据(查询当前版本的数据)
      • 时间旅行查询(查询历史版本)
      • Update(Append)
      • Incremental query
      • 查询指定时间点的数据(Point in time query)
      • Delete
      • (补充)纯覆盖式Insert数据不更新
    • FAQ
      • 'JavaPackage' object is not callable
        • ERROR INFO
        • Solution
      • java.lang.ClassNotFoundException: hudi.DefaultSource
        • ERROR INFO
        • Solution
      • Data must have been written before performing the update operation
        • ERROR INFO
        • Solution
    • 参考文档

前言

本文参考 Hudi 的官方文档中quick-start-guide部分完成 : https://hudi.apache.org/docs/quick-start-guide(单纯照着文档来我是没搞起来,总之踩了一些小坑)

请严格遵守jar包版本的约束

Hudi Supported Spark 3 version
0.10.0 3.1.x (default build), 3.0.x
0.7.0 - 0.9.0 3.0.x
0.6.0 and prior not supported

spark shell 配置

这个步骤是根据官网的描述操作的,只不过一开始没有进行下载jar,踩了一个微型坑

启动pyspark的hudi环境报错

ERROR INFO:

(venv) gavin@GavindeMacBook-Pro test % which python3.8
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python3.8
(venv) gavin@GavindeMacBook-Pro test % export PYSPARK_PYTHON=$(which python3.8)
(venv) gavin@GavindeMacBook-Pro test % echo ${PYSPARK_PYTHON}
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python3.8
(venv) gavin@GavindeMacBook-Pro test % pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Python 3.8.9 (default, Oct 26 2021, 07:25:54) 
[Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/gavin/.ivy2/cache
The jars for the packages stored in: /Users/gavin/.ivy2/jars
org.apache.hudi#hudi-spark3.1.2-bundle_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-311b5800-4168-498b-987f-f714233cf50c;1.0
        confs: [default]
:: resolution report :: resolve 614199ms :: artifacts dl 0ms
        :: modules in use:
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   2   |   0   |   0   |   0   ||   0   |   0   |
        ---------------------------------------------------------------------

:: problems summary ::
:::: WARNINGS
                module not found: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1

        ==== local-m2-cache: tried

          file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom

          -- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:

          file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar

        ==== local-ivy-cache: tried

          /Users/gavin/.ivy2/local/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/ivys/ivy.xml

          -- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:

          /Users/gavin/.ivy2/local/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/jars/hudi-spark3.1.2-bundle_2.12.jar

        ==== central: tried

          https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom

          -- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:

          https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar

        ==== spark-packages: tried

          https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom

          -- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:

          https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar

                module not found: org.apache.spark#spark-avro_2.12;3.1.2

        ==== local-m2-cache: tried

          file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom

          -- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:

          file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar

        ==== local-ivy-cache: tried

          /Users/gavin/.ivy2/local/org.apache.spark/spark-avro_2.12/3.1.2/ivys/ivy.xml

          -- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:

          /Users/gavin/.ivy2/local/org.apache.spark/spark-avro_2.12/3.1.2/jars/spark-avro_2.12.jar

        ==== central: tried

          https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom

          -- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:

          https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar

        ==== spark-packages: tried

          https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom

          -- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:

          https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar

                ::::::::::::::::::::::::::::::::::::::::::::::

                ::          UNRESOLVED DEPENDENCIES         ::

                ::::::::::::::::::::::::::::::::::::::::::::::

                :: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1: not found

                :: org.apache.spark#spark-avro_2.12;3.1.2: not found

                ::::::::::::::::::::::::::::::::::::::::::::::


:::: ERRORS
        Server access error at url https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom (java.net.ConnectException: Operation timed out (Connection timed out))

        Server access error at url https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar (java.net.ConnectException: Operation timed out (Connection timed out))


:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1: not found, unresolved dependency: org.apache.spark#spark-avro_2.12;3.1.2: not found]
        at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1429)
        at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/python/pyspark/shell.py", line 35, in <module>
    SparkContext._ensure_initialized()  # type: ignore
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/context.py", line 331, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/java_gateway.py", line 108, in launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
>>> exit()

观察到回去本地maven仓库查找jar包,于是使用maven下载需要的两个jar包(对应的spark版本是3.1.2):

        <dependency>
            <groupId>org.apache.hudigroupId>
            <artifactId>hudi-spark3.1.2-bundle_2.12artifactId>
            <version>0.10.1version>
        dependency>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-avro_2.12artifactId>
            <version>3.1.2version>
        dependency>

也可以自己直接到网站下载:

Hudi jar download:

https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/

https://repo1.maven.org/maven2/org/apache/hudi/hudi/
Apache Hudi - 1 - quick-start-guide 快速入门_第1张图片

maven中jar下载完成之后,key在本地仓库中看到jar:/Users/gavin/.m2/repository/org/apache/hudi/…

成功启动带hudi的pyspark

(venv) gavin@GavindeMacBook-Pro test % pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Python 3.8.9 (default, Oct 26 2021, 07:25:54) 
[Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
22/03/01 10:20:28 WARN Utils: Your hostname, GavindeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.24.227 instead (on interface en0)
22/03/01 10:20:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/gavin/.ivy2/cache
The jars for the packages stored in: /Users/gavin/.ivy2/jars
org.apache.hudi#hudi-spark3.1.2-bundle_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9a87dae7-3c6a-4133-838b-c7050b1d8b89;1.0
        confs: [default]
        found org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1 in local-m2-cache
        found org.apache.spark#spark-avro_2.12;3.1.2 in local-m2-cache
        found org.spark-project.spark#unused;1.0.0 in local-m2-cache
downloading file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar ...
        [SUCCESSFUL ] org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar (54ms)
downloading file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar ...
        [SUCCESSFUL ] org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar (2ms)
downloading file:/Users/gavin/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar ...
        [SUCCESSFUL ] org.spark-project.spark#unused;1.0.0!unused.jar (2ms)
:: resolution report :: resolve 6622ms :: artifacts dl 62ms
        :: modules in use:
        org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1 from local-m2-cache in [default]
        org.apache.spark#spark-avro_2.12;3.1.2 from local-m2-cache in [default]
        org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   3   |   3   |   0   ||   3   |   3   |
        ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-9a87dae7-3c6a-4133-838b-c7050b1d8b89
        confs: [default]
        3 artifacts copied, 0 already retrieved (38092kB/67ms)
22/03/01 10:20:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.8.9 (default, Oct 26 2021 07:25:54)
Spark context Web UI available at http://192.168.24.227:4040
Spark context available as 'sc' (master = local[*], app id = local-1646101237379).
SparkSession available as 'spark'.
>>> 

IDEA 代码方式

插入数据(表不存在则新建表,upsert)

此例讲述的是「upsert」类型的插入,即「存在符合条件的则更新,不存在则新增」,具体是使用什么类型的数据插入方式,是由参数「hoodie.datasource.write.operation」控制的,具体参数的说明可见https://hudi.apache.org/docs/configurations

根据官网的document(https://hudi.apache.org/docs/quick-start-guide),得到如下代码:

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars", "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                              "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext

    # pyspark
    tableName = "hudi_trips_cow"
    basePath = "file:///tmp/hudi_trips_cow"
    dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

    # pyspark
    inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
    df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))

    hudi_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'uuid',
        'hoodie.datasource.write.partitionpath.field': 'partitionpath',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.operation': 'upsert',
        'hoodie.datasource.write.precombine.field': 'ts',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2
    }

    df.write.format("hudi"). \
        options(**hudi_options). \
        mode("overwrite"). \
        save(basePath)

ps:插入的数据使用的是「org.apache.hudi.QuickstartUtils.DataGenerator()」生成的样例数据(官网的代码就是这么干的,具体数据内容可参见查询 章节)

运行代码后得到如下目录结构(妥妥的分区表目录结构):

gavin@GavindeMacBook-Pro apache % tree /tmp/hudi_trips_cow 
/tmp/hudi_trips_cow
├── americas
│   ├── brazil
│   │   └── sao_paulo
│   │       └── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-27-31_20220301105108074.parquet
│   └── united_states
│       └── san_francisco
│           └── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-33-32_20220301105108074.parquet
└── asia
    └── india
        └── chennai
            └── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-33-33_20220301105108074.parquet

8 directories, 3 files
gavin@GavindeMacBook-Pro apache % 

扩展

除了基本的数据文件外,hudi还有一个metadata的隐藏文件「.hoodie」,文件具体内容再叙:

gavin@GavindeMacBook-Pro hudi_trips_cow % ll -a
total 0
drwxr-xr-x   5 gavin  wheel  160 Mar  1 10:51 .
drwxrwxrwt  10 root   wheel  320 Mar  1 11:26 ..
drwxr-xr-x  13 gavin  wheel  416 Mar  1 10:51 .hoodie
drwxr-xr-x   4 gavin  wheel  128 Mar  1 10:51 americas
drwxr-xr-x   3 gavin  wheel   96 Mar  1 10:51 asia
gavin@GavindeMacBook-Pro hudi_trips_cow % tree .hoodie 
.hoodie
├── 20220301105108074.commit
├── 20220301105108074.commit.requested
├── 20220301105108074.inflight
├── archived
└── hoodie.properties

查询数据(查询当前版本的数据)

查询代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext
    basePath = "file:///tmp/hudi_trips_cow"

    # pyspark
    tripsSnapshotDF = spark. \
        read. \
        format("hudi"). \
        load(basePath)
    # load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery

    count = tripsSnapshotDF.count()
    print(f'========hudi_trips_snapshot 表中共计[{count}]条数据')
    print('表结构如下:')
    tripsSnapshotDF.printSchema()

    tripsSnapshotDF.show()

    tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")

    spark.sql("select fare, begin_lon, begin_lat, ts from  hudi_trips_snapshot where fare > 20.0").show()
    spark.sql(
        "select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from  hudi_trips_snapshot").show()

查询结果

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py
22/03/01 11:18:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 11:18:27 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
========hudi_trips_snapshot 表中共计[10]条数据
表结构如下:
root
 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- begin_lat: double (nullable = true)
 |-- begin_lon: double (nullable = true)
 |-- driver: string (nullable = true)
 |-- end_lat: double (nullable = true)
 |-- end_lon: double (nullable = true)
 |-- fare: double (nullable = true)
 |-- rider: string (nullable = true)
 |-- ts: long (nullable = true)
 |-- uuid: string (nullable = true)
 |-- partitionpath: string (nullable = true)

+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|          begin_lat|          begin_lon|    driver|            end_lat|            end_lon|              fare|    rider|           ts|                uuid|       partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|  20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...|  americas/united_s...|52a5ee08-9376-495...|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
|  20220301105108074|20220301105108074...|67ee90ec-d7b8-477...|  americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
|  20220301105108074|20220301105108074...|91703076-f580-49f...|  americas/united_s...|52a5ee08-9376-495...|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
|  20220301105108074|20220301105108074...|96a7571e-1e54-4bc...|  americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
|  20220301105108074|20220301105108074...|3723b4ac-8841-4cd...|  americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
|  20220301105108074|20220301105108074...|b3bf0b93-768d-4be...|  americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...|  americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653|0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|3409ecd2-02c2-40c...|  americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|60903a00-3fdc-45d...|    asia/india/chennai|2f5b659d-3738-48c...|  0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...|  asia/india/chennai|
|  20220301105108074|20220301105108074...|22d1507b-7d02-402...|    asia/india/chennai|2f5b659d-3738-48c...|   0.40613510977307| 0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...|  asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+

+------------------+-------------------+-------------------+-------------+
|              fare|          begin_lon|          begin_lat|           ts|
+------------------+-------------------+-------------------+-------------+
| 93.56018115236618|0.14285051259466197|0.21624150367601136|1645736908516|
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1645565690012|
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1646031306513|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1646085368961|
|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|1645868768394|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|1645602479789|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1645621352954|
| 41.06290929046368| 0.8192868687714224|  0.651058505660742|1645503078006|
+------------------+-------------------+-------------------+-------------+

+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time|  _hoodie_record_key|_hoodie_partition_path|    rider|    driver|              fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
|  20220301105108074|c4340e2c-efd2-4a9...|  americas/united_s...|rider-213|driver-213| 93.56018115236618|
|  20220301105108074|67ee90ec-d7b8-477...|  americas/united_s...|rider-213|driver-213| 64.27696295884016|
|  20220301105108074|91703076-f580-49f...|  americas/united_s...|rider-213|driver-213| 27.79478688582596|
|  20220301105108074|96a7571e-1e54-4bc...|  americas/united_s...|rider-213|driver-213|19.179139106643607|
|  20220301105108074|3723b4ac-8841-4cd...|  americas/united_s...|rider-213|driver-213| 33.92216483948643|
|  20220301105108074|b3bf0b93-768d-4be...|  americas/brazil/s...|rider-213|driver-213|  43.4923811219014|
|  20220301105108074|7e195e8d-c6df-4fd...|  americas/brazil/s...|rider-213|driver-213|34.158284716382845|
|  20220301105108074|3409ecd2-02c2-40c...|  americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
|  20220301105108074|60903a00-3fdc-45d...|    asia/india/chennai|rider-213|driver-213| 41.06290929046368|
|  20220301105108074|22d1507b-7d02-402...|    asia/india/chennai|rider-213|driver-213|17.851135255091155|
+-------------------+--------------------+----------------------+---------+----------+------------------+


Process finished with exit code 0

时间旅行查询(查询历史版本)

查询代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext
    basePath = "file:///tmp/hudi_trips_cow"

    # 以下列举三种查询方式

    # pyspark
    spark.read. \
        format("hudi"). \
        option("as.of.instant", "20210728141108"). \
        load(basePath).show()

    spark.read. \
        format("hudi"). \
        option("as.of.instant", "2022-02-28 14:11:08.000"). \
        load(basePath).show()

    # It is equal to "as.of.instant = 2021-07-28 00:00:00"
    spark.read. \
        format("hudi"). \
        option("as.of.instant", "2022-07-28"). \
        load(basePath).show()

Ps:数据的产生时间是「20220301」

查询结果

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query_time_travel.py
22/03/01 11:30:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 11:30:10 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|begin_lat|begin_lon|driver|end_lat|end_lon|fare|rider| ts|uuid|partitionpath|
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+

+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|begin_lat|begin_lon|driver|end_lat|end_lon|fare|rider| ts|uuid|partitionpath|
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+

+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|          begin_lat|          begin_lon|    driver|            end_lat|            end_lon|              fare|    rider|           ts|                uuid|       partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|  20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...|  americas/united_s...|52a5ee08-9376-495...|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
|  20220301105108074|20220301105108074...|67ee90ec-d7b8-477...|  americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
|  20220301105108074|20220301105108074...|91703076-f580-49f...|  americas/united_s...|52a5ee08-9376-495...|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
|  20220301105108074|20220301105108074...|96a7571e-1e54-4bc...|  americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
|  20220301105108074|20220301105108074...|3723b4ac-8841-4cd...|  americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
|  20220301105108074|20220301105108074...|b3bf0b93-768d-4be...|  americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...|  americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653|0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|3409ecd2-02c2-40c...|  americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|60903a00-3fdc-45d...|    asia/india/chennai|2f5b659d-3738-48c...|  0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...|  asia/india/chennai|
|  20220301105108074|20220301105108074...|22d1507b-7d02-402...|    asia/india/chennai|2f5b659d-3738-48c...|   0.40613510977307| 0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...|  asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+


Process finished with exit code 0

Update(Append)

代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext
    basePath = "file:///tmp/hudi_trips_cow"

    dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
    tableName = "hudi_trips_cow"
    hudi_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'uuid',
        'hoodie.datasource.write.partitionpath.field': 'partitionpath',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.operation': 'upsert',
        'hoodie.datasource.write.precombine.field': 'ts',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2
    }

    #由于我将update操作与insert的步骤独立出来了,如果直接使用dataGen.generateUpdates(10)会报错,需要先执行一个生成insert数据的动作
    #由于我将update操作与insert的步骤独立出来,所以这个更新的数据和之前插入的数据和在同一个代码中得到的会不同。不过影响不大。
    dataGen.generateInserts(10)

    # pyspark
    updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
    df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
    df.show()
    df.write.format("hudi"). \
        options(**hudi_options). \
        mode("append"). \
        save(basePath)

运行日志

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_update.py
22/03/01 13:28:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 13:28:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
+--------------------+--------------------+----------+-------------------+-------------------+------------------+--------------------+---------+-------------+--------------------+
|           begin_lat|           begin_lon|    driver|            end_lat|            end_lon|              fare|       partitionpath|    rider|           ts|                uuid|
+--------------------+--------------------+----------+-------------------+-------------------+------------------+--------------------+---------+-------------+--------------------+
|  0.7340133901254792|  0.5142184937933181|driver-284| 0.7814655558162802| 0.6592596683641996|49.527694252432056|americas/united_s...|rider-284|1645635237429|0456f152-9a1b-48f...|
|  0.1593867607188556|0.010872312870502165|driver-284| 0.9808530350038475| 0.7963756520507014| 29.47661370147079|americas/brazil/s...|rider-284|1645899391613|1e34b971-3dfc-489...|
|  0.7180196467760873| 0.13755354862499358|driver-284| 0.3037264771699937| 0.2539047155055727| 86.75932789048282|americas/brazil/s...|rider-284|1645890122334|1e34b971-3dfc-489...|
|  0.6570857443423376|   0.888493603696927|driver-284| 0.9036309069576131|0.37603706507284995| 63.72504913279929|americas/brazil/s...|rider-284|1645547517087|8d784cd0-02d9-429...|
| 0.08528650347654165|  0.4006983139989222|driver-284| 0.1975324518739051|  0.908216792146506| 90.25710109008239|  asia/india/chennai|rider-284|1646095456906|bc2e551e-4206-4f0...|
| 0.18294079059016366| 0.19949323322922063|driver-284|0.24749642418050566| 0.1751761658135068|  90.9053809533154|americas/united_s...|rider-284|1645675773158|da50d4f5-94cb-41c...|
|  0.4777395067707303|  0.3349917833248327|driver-284| 0.9735699951963335| 0.8144901865212508|  98.3428192817987|americas/united_s...|rider-284|1646066699577|a24084ea-4473-459...|
|0.014159831486388885| 0.42849372303000655|driver-284| 0.9968531966280192| 0.9451993293955782| 2.375516772415698|americas/united_s...|rider-284|1645728852563|da50d4f5-94cb-41c...|
| 0.16603428449020086|  0.6999655248704163|driver-284| 0.5086437188581894| 0.6242134749327686| 9.384124531808036|  asia/india/chennai|rider-284|1645620049479|9cf010a9-7303-4c5...|
|  0.2110206104048945|  0.2783086084578943|driver-284|0.12154541219767523| 0.8700506703716298| 91.99515909032544|americas/brazil/s...|rider-284|1645773817699|1e34b971-3dfc-489...|
+--------------------+--------------------+----------+-------------------+-------------------+------------------+--------------------+---------+-------------+--------------------+

22/03/01 13:28:27 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/03/01 13:28:27 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file

Process finished with exit code 0

更新后的目录结构

  • 元数据目录:
# 更新之前
gavin@GavindeMacBook-Pro .hoodie % ll
total 32
-rw-r--r--  1 gavin  wheel  4374 Mar  1 10:51 20220301105108074.commit
-rw-r--r--  1 gavin  wheel     0 Mar  1 10:51 20220301105108074.commit.requested
-rw-r--r--  1 gavin  wheel  2594 Mar  1 10:51 20220301105108074.inflight
drwxr-xr-x  2 gavin  wheel    64 Mar  1 10:51 archived
-rw-r--r--  1 gavin  wheel   600 Mar  1 10:51 hoodie.properties
# 更新之后
gavin@GavindeMacBook-Pro .hoodie % ll
total 56
-rw-r--r--  1 gavin  wheel  4374 Mar  1 10:51 20220301105108074.commit
-rw-r--r--  1 gavin  wheel     0 Mar  1 10:51 20220301105108074.commit.requested
-rw-r--r--  1 gavin  wheel  2594 Mar  1 10:51 20220301105108074.inflight
-rw-r--r--  1 gavin  wheel  4413 Mar  1 13:28 20220301132827300.commit
-rw-r--r--  1 gavin  wheel     0 Mar  1 13:28 20220301132827300.commit.requested
-rw-r--r--  1 gavin  wheel  2594 Mar  1 13:28 20220301132827300.inflight
drwxr-xr-x  2 gavin  wheel    64 Mar  1 10:51 archived
-rw-r--r--  1 gavin  wheel   600 Mar  1 10:51 hoodie.properties
gavin@GavindeMacBook-Pro .hoodie % 
  • 数据文件目录
#更新之前
gavin@GavindeMacBook-Pro apache % tree /tmp/hudi_trips_cow 
/tmp/hudi_trips_cow
├── americas
│   ├── brazil
│   │   └── sao_paulo
│   │       └── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-27-31_20220301105108074.parquet
│   └── united_states
│       └── san_francisco
│           └── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-33-32_20220301105108074.parquet
└── asia
    └── india
        └── chennai
            └── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-33-33_20220301105108074.parquet

8 directories, 3 files
gavin@GavindeMacBook-Pro apache % 

#更新之后
gavin@GavindeMacBook-Pro hudi_trips_cow % tree ./*
./americas
├── brazil
│   └── sao_paulo
│       ├── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-27-31_20220301105108074.parquet
│       └── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-29-39_20220301132827300.parquet
└── united_states
    └── san_francisco
        ├── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-33-32_20220301105108074.parquet
        └── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-35-40_20220301132827300.parquet
./asia
└── india
    └── chennai
        ├── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-33-33_20220301105108074.parquet
        └── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-35-41_20220301132827300.parquet

6 directories, 6 files
gavin@GavindeMacBook-Pro hudi_trips_cow % 

额外做一个查询,看看当前数据是否新增了:

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py
22/03/01 13:34:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 13:34:56 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
========hudi_trips_snapshot 表中共计[17]条数据
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|           begin_lat|           begin_lon|    driver|            end_lat|            end_lon|              fare|    rider|           ts|                uuid|       partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|  20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...|  americas/united_s...|52a5ee08-9376-495...| 0.21624150367601136| 0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
|  20220301105108074|20220301105108074...|67ee90ec-d7b8-477...|  americas/united_s...|52a5ee08-9376-495...|  0.5731835407930634|  0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
|  20220301105108074|20220301105108074...|91703076-f580-49f...|  americas/united_s...|52a5ee08-9376-495...| 0.11488393157088261|  0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
|  20220301105108074|20220301105108074...|96a7571e-1e54-4bc...|  americas/united_s...|52a5ee08-9376-495...|  0.8742041526408587|  0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
|  20220301105108074|20220301105108074...|3723b4ac-8841-4cd...|  americas/united_s...|52a5ee08-9376-495...|  0.1856488085068272|  0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
|  20220301132827300|20220301132827300...|a24084ea-4473-459...|  americas/united_s...|52a5ee08-9376-495...|  0.4777395067707303|  0.3349917833248327|driver-284| 0.9735699951963335| 0.8144901865212508|  98.3428192817987|rider-284|1646066699577|a24084ea-4473-459...|americas/united_s...|
|  20220301132827300|20220301132827300...|0456f152-9a1b-48f...|  americas/united_s...|52a5ee08-9376-495...|  0.7340133901254792|  0.5142184937933181|driver-284| 0.7814655558162802| 0.6592596683641996|49.527694252432056|rider-284|1645635237429|0456f152-9a1b-48f...|americas/united_s...|
|  20220301132827300|20220301132827300...|da50d4f5-94cb-41c...|  americas/united_s...|52a5ee08-9376-495...|0.014159831486388885| 0.42849372303000655|driver-284| 0.9968531966280192| 0.9451993293955782| 2.375516772415698|rider-284|1645728852563|da50d4f5-94cb-41c...|americas/united_s...|
|  20220301105108074|20220301105108074...|b3bf0b93-768d-4be...|  americas/brazil/s...|6f82f351-9994-459...|  0.6100070562136587|  0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...|  americas/brazil/s...|6f82f351-9994-459...|  0.4726905879569653| 0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|3409ecd2-02c2-40c...|  americas/brazil/s...|6f82f351-9994-459...|  0.0750588760043035| 0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
|  20220301132827300|20220301132827300...|8d784cd0-02d9-429...|  americas/brazil/s...|6f82f351-9994-459...|  0.6570857443423376|   0.888493603696927|driver-284| 0.9036309069576131|0.37603706507284995| 63.72504913279929|rider-284|1645547517087|8d784cd0-02d9-429...|americas/brazil/s...|
|  20220301132827300|20220301132827300...|1e34b971-3dfc-489...|  americas/brazil/s...|6f82f351-9994-459...|  0.1593867607188556|0.010872312870502165|driver-284| 0.9808530350038475| 0.7963756520507014| 29.47661370147079|rider-284|1645899391613|1e34b971-3dfc-489...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|60903a00-3fdc-45d...|    asia/india/chennai|2f5b659d-3738-48c...|   0.651058505660742|  0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...|  asia/india/chennai|
|  20220301105108074|20220301105108074...|22d1507b-7d02-402...|    asia/india/chennai|2f5b659d-3738-48c...|    0.40613510977307|  0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...|  asia/india/chennai|
|  20220301132827300|20220301132827300...|bc2e551e-4206-4f0...|    asia/india/chennai|2f5b659d-3738-48c...| 0.08528650347654165|  0.4006983139989222|driver-284| 0.1975324518739051|  0.908216792146506| 90.25710109008239|rider-284|1646095456906|bc2e551e-4206-4f0...|  asia/india/chennai|
|  20220301132827300|20220301132827300...|9cf010a9-7303-4c5...|    asia/india/chennai|2f5b659d-3738-48c...| 0.16603428449020086|  0.6999655248704163|driver-284| 0.5086437188581894| 0.6242134749327686| 9.384124531808036|rider-284|1645620049479|9cf010a9-7303-4c5...|  asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+


Process finished with exit code 0

Ps:上述中数据新增了7条,但是生成的update数据有10条,采用的是append的方式进行更新;得到的结果是更新之后数据有17条;这个是因为数据append不是单纯的追加,而是使用「‘hoodie.datasource.write.operation’: ‘upsert’」的option的追加。也就是默认表已经存在,将新数据进行upsert。官网中描述的是『 In general, always use append mode unless you are trying to create the table for the first time』。所以对于appendoverwrite两种模式的选择,如果不是首次建表,基本都选择append

扩展·查询历史版本

根据第二次的提交记录可得到一个准确的时间点为「20220301132827300」:

-rw-r--r-- 1 gavin wheel 4413 Mar 1 13:28 20220301132827300.commit

然后查到更新之前的版本数据:

spark.read.format("hudi").option("as.of.instant", "20220301132827300").load(basePath).show()

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.8.9 (default, Oct 26 2021 07:25:54)
Spark context Web UI available at http://192.168.24.227:4040
Spark context available as 'sc' (master = local[*], app id = local-1646101237379).
SparkSession available as 'spark'.
>>> basePath = "file:///tmp/hudi_trips_cow"
>>> df = spark.read.format("hudi").option("as.of.instant", "20220301132827300").load(basePath)
>>> df.count()
17
>>> df.show()                                                                   
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|           begin_lat|           begin_lon|    driver|            end_lat|            end_lon|              fare|    rider|           ts|                uuid|       partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|  20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...|  americas/united_s...|52a5ee08-9376-495...| 0.21624150367601136| 0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
|  20220301105108074|20220301105108074...|67ee90ec-d7b8-477...|  americas/united_s...|52a5ee08-9376-495...|  0.5731835407930634|  0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
|  20220301105108074|20220301105108074...|91703076-f580-49f...|  americas/united_s...|52a5ee08-9376-495...| 0.11488393157088261|  0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
|  20220301105108074|20220301105108074...|96a7571e-1e54-4bc...|  americas/united_s...|52a5ee08-9376-495...|  0.8742041526408587|  0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
|  20220301105108074|20220301105108074...|3723b4ac-8841-4cd...|  americas/united_s...|52a5ee08-9376-495...|  0.1856488085068272|  0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
|  20220301132827300|20220301132827300...|a24084ea-4473-459...|  americas/united_s...|52a5ee08-9376-495...|  0.4777395067707303|  0.3349917833248327|driver-284| 0.9735699951963335| 0.8144901865212508|  98.3428192817987|rider-284|1646066699577|a24084ea-4473-459...|americas/united_s...|
|  20220301132827300|20220301132827300...|0456f152-9a1b-48f...|  americas/united_s...|52a5ee08-9376-495...|  0.7340133901254792|  0.5142184937933181|driver-284| 0.7814655558162802| 0.6592596683641996|49.527694252432056|rider-284|1645635237429|0456f152-9a1b-48f...|americas/united_s...|
|  20220301132827300|20220301132827300...|da50d4f5-94cb-41c...|  americas/united_s...|52a5ee08-9376-495...|0.014159831486388885| 0.42849372303000655|driver-284| 0.9968531966280192| 0.9451993293955782| 2.375516772415698|rider-284|1645728852563|da50d4f5-94cb-41c...|americas/united_s...|
|  20220301105108074|20220301105108074...|b3bf0b93-768d-4be...|  americas/brazil/s...|6f82f351-9994-459...|  0.6100070562136587|  0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...|  americas/brazil/s...|6f82f351-9994-459...|  0.4726905879569653| 0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|3409ecd2-02c2-40c...|  americas/brazil/s...|6f82f351-9994-459...|  0.0750588760043035| 0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
|  20220301132827300|20220301132827300...|8d784cd0-02d9-429...|  americas/brazil/s...|6f82f351-9994-459...|  0.6570857443423376|   0.888493603696927|driver-284| 0.9036309069576131|0.37603706507284995| 63.72504913279929|rider-284|1645547517087|8d784cd0-02d9-429...|americas/brazil/s...|
|  20220301132827300|20220301132827300...|1e34b971-3dfc-489...|  americas/brazil/s...|6f82f351-9994-459...|  0.1593867607188556|0.010872312870502165|driver-284| 0.9808530350038475| 0.7963756520507014| 29.47661370147079|rider-284|1645899391613|1e34b971-3dfc-489...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|60903a00-3fdc-45d...|    asia/india/chennai|2f5b659d-3738-48c...|   0.651058505660742|  0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...|  asia/india/chennai|
|  20220301105108074|20220301105108074...|22d1507b-7d02-402...|    asia/india/chennai|2f5b659d-3738-48c...|    0.40613510977307|  0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...|  asia/india/chennai|
|  20220301132827300|20220301132827300...|bc2e551e-4206-4f0...|    asia/india/chennai|2f5b659d-3738-48c...| 0.08528650347654165|  0.4006983139989222|driver-284| 0.1975324518739051|  0.908216792146506| 90.25710109008239|rider-284|1646095456906|bc2e551e-4206-4f0...|  asia/india/chennai|
|  20220301132827300|20220301132827300...|9cf010a9-7303-4c5...|    asia/india/chennai|2f5b659d-3738-48c...| 0.16603428449020086|  0.6999655248704163|driver-284| 0.5086437188581894| 0.6242134749327686| 9.384124531808036|rider-284|1645620049479|9cf010a9-7303-4c5...|  asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+

## 查询更新之前的数据20220301132827300 -> 20220301132527300
>>> df_before_update = spark.read.format("hudi").option("as.of.instant", "20220301132527300").load(basePath)
>>> df_before_update.count()
10
>>> df_before_update.show()
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|          begin_lat|          begin_lon|    driver|            end_lat|            end_lon|              fare|    rider|           ts|                uuid|       partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|  20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...|  americas/united_s...|52a5ee08-9376-495...|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
|  20220301105108074|20220301105108074...|67ee90ec-d7b8-477...|  americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
|  20220301105108074|20220301105108074...|91703076-f580-49f...|  americas/united_s...|52a5ee08-9376-495...|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
|  20220301105108074|20220301105108074...|96a7571e-1e54-4bc...|  americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302|  0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
|  20220301105108074|20220301105108074...|3723b4ac-8841-4cd...|  americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
|  20220301105108074|20220301105108074...|b3bf0b93-768d-4be...|  americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...|  americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653|0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|3409ecd2-02c2-40c...|  americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
|  20220301105108074|20220301105108074...|60903a00-3fdc-45d...|    asia/india/chennai|2f5b659d-3738-48c...|  0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...|  asia/india/chennai|
|  20220301105108074|20220301105108074...|22d1507b-7d02-402...|    asia/india/chennai|2f5b659d-3738-48c...|   0.40613510977307| 0.5644092139040959|driver-213|  0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...|  asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+

Incremental query

代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext

    # pyspark
    tableName = "hudi_trips_cow"
    basePath = "file:///tmp/hudi_trips_cow"

    # pyspark
    # reload data
    spark. \
        read. \
        format("hudi"). \
        load(basePath). \
        createOrReplaceTempView("hudi_trips_snapshot")

    #先获取所有的已经存在的提交时间,再取倒数第二个作为增量查询的开始时间进行查询(不设置增量查询结束时间则表示查询开始时间之后的所有数据)
    commits = list(map(lambda row: row[0], spark.sql(
        "select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").limit(
        50).collect()))
    print(f'get commit info [{commits}]')
    beginTime = commits[len(commits) - 2]  # commit time we are interested in
    print(f'set beginTime as [{beginTime}]')
    # incrementally query data
    incremental_read_options = {
        'hoodie.datasource.query.type': 'incremental',
        'hoodie.datasource.read.begin.instanttime': beginTime,
    }

    tripsIncrementalDF = spark.read.format("hudi"). \
        options(**incremental_read_options). \
        load(basePath)
    tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")

    spark.sql(
        "select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from  hudi_trips_incremental where fare > 20.0").show()

执行结果

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic/basic_incremental_query.py
22/03/01 15:08:26 WARN Utils: Your hostname, GavindeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.196.59.24 instead (on interface en0)
22/03/01 15:08:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/03/01 15:08:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 15:08:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
get commit info [['20220301105108074', '20220301132827300']]
set beginTime as [20220301105108074]
+-------------------+------------------+--------------------+-------------------+-------------+
|_hoodie_commit_time|              fare|           begin_lon|          begin_lat|           ts|
+-------------------+------------------+--------------------+-------------------+-------------+
|  20220301132827300|  98.3428192817987|  0.3349917833248327| 0.4777395067707303|1646066699577|
|  20220301132827300|49.527694252432056|  0.5142184937933181| 0.7340133901254792|1645635237429|
|  20220301132827300| 63.72504913279929|   0.888493603696927| 0.6570857443423376|1645547517087|
|  20220301132827300| 29.47661370147079|0.010872312870502165| 0.1593867607188556|1645899391613|
|  20220301132827300| 90.25710109008239|  0.4006983139989222|0.08528650347654165|1646095456906|
+-------------------+------------------+--------------------+-------------------+-------------+


Process finished with exit code 0

查询指定时间点的数据(Point in time query)

代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext
    basePath = "file:///tmp/hudi_trips_cow"

    # reload data
    spark. \
        read. \
        format("hudi"). \
        load(basePath). \
        createOrReplaceTempView("hudi_trips_snapshot")

    commits = list(map(lambda row: row[0], spark.sql(
        "select distinct(_hoodie_commit_time) as commitTime from  hudi_trips_snapshot order by commitTime").limit(
        50).collect()))
    endTime = commits[len(commits) - 2]

    beginTime = "000"  # Represents all commits > this time.

    # query point in time data
    point_in_time_read_options = {
        'hoodie.datasource.query.type': 'incremental',
        'hoodie.datasource.read.end.instanttime': endTime,
        'hoodie.datasource.read.begin.instanttime': beginTime
    }

    tripsPointInTimeDF = spark.read.format("hudi"). \
        options(**point_in_time_read_options). \
        load(basePath)

    tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
    spark.sql(
        "select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()

运行结果

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic/basic_point_in_time_query.py
22/03/01 15:18:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 15:18:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
+-------------------+------------------+-------------------+-------------------+-------------+
|_hoodie_commit_time|              fare|          begin_lon|          begin_lat|           ts|
+-------------------+------------------+-------------------+-------------------+-------------+
|  20220301105108074| 93.56018115236618|0.14285051259466197|0.21624150367601136|1645736908516|
|  20220301105108074| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1645565690012|
|  20220301105108074| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1646031306513|
|  20220301105108074| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1646085368961|
|  20220301105108074|  43.4923811219014| 0.8779402295427752| 0.6100070562136587|1645868768394|
|  20220301105108074|34.158284716382845|0.46157858450465483| 0.4726905879569653|1645602479789|
|  20220301105108074| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1645621352954|
|  20220301105108074| 41.06290929046368| 0.8192868687714224|  0.651058505660742|1645503078006|
+-------------------+------------------+-------------------+-------------------+-------------+


Process finished with exit code 0

Delete

此例中将从原表中随机取出两条数据,然后根据取出数据做一个条件查询删除,将原表中对应的这两条数据进行删除

ps:delete动作只能在append模式下进行

代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext

    # pyspark
    tableName = "hudi_trips_cow"
    basePath = "file:///tmp/hudi_trips_cow"

    # pyspark
    # reload data
    spark. \
        read. \
        format("hudi"). \
        load(basePath). \
        createOrReplaceTempView("hudi_trips_snapshot")

    before_count = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
    print(f'before delete , there exists [{before_count}] records')
    # fetch two records to be deleted
    ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
    print(f'Records will be deleted :[{ds.collect()}]')
    # issue deletes
    hudi_delete_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'uuid',
        'hoodie.datasource.write.partitionpath.field': 'partitionpath',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.operation': 'delete',
        'hoodie.datasource.write.precombine.field': 'ts',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2
    }

    from pyspark.sql.functions import lit

    deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
    print(f'deletes data: [{deletes}]')
    #在生成的DFs后面再新增一列「ts」
    df = spark.sparkContext.parallelize(deletes).toDF(['uuid', 'partitionpath']).withColumn('ts', lit(0.0))
    df.write.format("hudi"). \
        options(**hudi_delete_options). \
        mode("append"). \
        save(basePath)

    # run the same read query as above.
    roAfterDeleteViewDF = spark. \
        read. \
        format("hudi"). \
        load(basePath)
    roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot")
    # fetch should return (total - 2) records
    after_count = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
    print(f'after delete , there exists [{after_count}] records')

运行结果

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic/basic_delete.py
22/03/01 15:59:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 15:59:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
before delete , there exists [17] records
Records will be deleted :[[Row(uuid='c4340e2c-efd2-4a92-9615-32822599d397', partitionpath='americas/united_states/san_francisco'), Row(uuid='67ee90ec-d7b8-4772-b02e-6a41a7556fa0', partitionpath='americas/united_states/san_francisco')]]
deletes data: [[('c4340e2c-efd2-4a92-9615-32822599d397', 'americas/united_states/san_francisco'), ('67ee90ec-d7b8-4772-b02e-6a41a7556fa0', 'americas/united_states/san_francisco')]]
22/03/01 15:59:34 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/03/01 15:59:34 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
after delete , there exists [15] records

Process finished with exit code 0

(补充)纯覆盖式Insert数据不更新

之前的介绍了upsert类型的,这里补充一下纯insert的数据插入,不带更新的那种,直接全量覆盖。

代码

import pyspark

if __name__ == '__main__':
    builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
        .config("spark.jars",
                "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
                "/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

    spark = builder.getOrCreate()
    sc = spark.sparkContext

    # pyspark
    tableName = "hudi_trips_cow_insert_overwirte"
    basePath = "file:///tmp/hudi_trips_cow_insert_overwirte"
    dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

    hudi_options = {
        'hoodie.table.name': tableName,
        'hoodie.datasource.write.recordkey.field': 'uuid',
        'hoodie.datasource.write.partitionpath.field': 'partitionpath',
        'hoodie.datasource.write.table.name': tableName,
        'hoodie.datasource.write.operation': 'insert',
        'hoodie.datasource.write.precombine.field': 'ts',
        'hoodie.upsert.shuffle.parallelism': 2,
        'hoodie.insert.shuffle.parallelism': 2
    }
    # pyspark
    #造3条数据用于演示
    inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(3))
    df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
    print(f'start to insert data into a new table:[{df.collect()}]')
    df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
    print(spark.read.format("hudi").load(basePath).collect())
    spark.read.format("hudi").load(basePath).show()

    inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(1))
    df_new = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
    print(f'start to insert data into a new table:[{df_new.collect()}]')
    df_new.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
    print(spark.read.format("hudi").load(basePath).collect())
    spark.read.format("hudi").load(basePath).show()

运行结果:后面插入的1条数据直接全量覆盖了之前insert的3条数据。最终结果表中的数据是1条,而不是4条

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/extend/insert_overwrite.py
22/03/01 16:21:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 16:21:04 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
start to insert data into a new table:[[Row(begin_lat=0.4726905879569653, begin_lon=0.46157858450465483, driver='driver-213', end_lat=0.754803407008858, end_lon=0.9671159942018241, fare=34.158284716382845, partitionpath='americas/brazil/sao_paulo', rider='rider-213', ts=1645537223581, uuid='a63b8b04-6dcc-4edc-af9f-2b9f6dcfe145'), Row(begin_lat=0.6100070562136587, begin_lon=0.8779402295427752, driver='driver-213', end_lat=0.3407870505929602, end_lon=0.5030798142293655, fare=43.4923811219014, partitionpath='americas/brazil/sao_paulo', rider='rider-213', ts=1645608818472, uuid='172f2894-285a-4b48-97c2-d92bf992697c'), Row(begin_lat=0.5731835407930634, begin_lon=0.4923479652912024, driver='driver-213', end_lat=0.08988581780930216, end_lon=0.42520899698713666, fare=64.27696295884016, partitionpath='americas/united_states/san_francisco', rider='rider-213', ts=1645587087764, uuid='aeea15f6-e5b7-438a-b1c6-c00c19347ca1')]]
22/03/01 16:21:09 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/03/01 16:21:09 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
[Row(_hoodie_commit_time='20220301162109926', _hoodie_commit_seqno='20220301162109926_1_3', _hoodie_record_key='aeea15f6-e5b7-438a-b1c6-c00c19347ca1', _hoodie_partition_path='americas/united_states/san_francisco', _hoodie_file_name='2cde0990-e337-4e8e-a6df-802f1b45fd53-0_1-10-17_20220301162109926.parquet', begin_lat=0.5731835407930634, begin_lon=0.4923479652912024, driver='driver-213', end_lat=0.08988581780930216, end_lon=0.42520899698713666, fare=64.27696295884016, rider='rider-213', ts=1645587087764, uuid='aeea15f6-e5b7-438a-b1c6-c00c19347ca1', partitionpath='americas/united_states/san_francisco'), Row(_hoodie_commit_time='20220301162109926', _hoodie_commit_seqno='20220301162109926_0_1', _hoodie_record_key='a63b8b04-6dcc-4edc-af9f-2b9f6dcfe145', _hoodie_partition_path='americas/brazil/sao_paulo', _hoodie_file_name='45b0d7ee-88ec-4a35-ad65-be649efe88be-0_0-8-16_20220301162109926.parquet', begin_lat=0.4726905879569653, begin_lon=0.46157858450465483, driver='driver-213', end_lat=0.754803407008858, end_lon=0.9671159942018241, fare=34.158284716382845, rider='rider-213', ts=1645537223581, uuid='a63b8b04-6dcc-4edc-af9f-2b9f6dcfe145', partitionpath='americas/brazil/sao_paulo'), Row(_hoodie_commit_time='20220301162109926', _hoodie_commit_seqno='20220301162109926_0_2', _hoodie_record_key='172f2894-285a-4b48-97c2-d92bf992697c', _hoodie_partition_path='americas/brazil/sao_paulo', _hoodie_file_name='45b0d7ee-88ec-4a35-ad65-be649efe88be-0_0-8-16_20220301162109926.parquet', begin_lat=0.6100070562136587, begin_lon=0.8779402295427752, driver='driver-213', end_lat=0.3407870505929602, end_lon=0.5030798142293655, fare=43.4923811219014, rider='rider-213', ts=1645608818472, uuid='172f2894-285a-4b48-97c2-d92bf992697c', partitionpath='americas/brazil/sao_paulo')]
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|         begin_lat|          begin_lon|    driver|            end_lat|            end_lon|              fare|    rider|           ts|                uuid|       partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|  20220301162109926|20220301162109926...|aeea15f6-e5b7-438...|  americas/united_s...|2cde0990-e337-4e8...|0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645587087764|aeea15f6-e5b7-438...|americas/united_s...|
|  20220301162109926|20220301162109926...|a63b8b04-6dcc-4ed...|  americas/brazil/s...|45b0d7ee-88ec-4a3...|0.4726905879569653|0.46157858450465483|driver-213|  0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645537223581|a63b8b04-6dcc-4ed...|americas/brazil/s...|
|  20220301162109926|20220301162109926...|172f2894-285a-4b4...|  americas/brazil/s...|45b0d7ee-88ec-4a3...|0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655|  43.4923811219014|rider-213|1645608818472|172f2894-285a-4b4...|americas/brazil/s...|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+

start to insert data into a new table:[[Row(begin_lat=0.6220454661413275, begin_lon=0.72024792576853, driver='driver-226', end_lat=0.9048755755365163, end_lon=0.727695054518325, fare=40.613510977307, partitionpath='americas/united_states/san_francisco', rider='rider-226', ts=1645914401180, uuid='db59af46-0fcf-4bb7-ab4a-b9387fb710d3')]]
22/03/01 16:21:14 WARN HoodieSparkSqlWriter$: hoodie table at file:/tmp/hudi_trips_cow_insert_overwirte already exists. Deleting existing data & overwriting with new data.
[Row(_hoodie_commit_time='20220301162114472', _hoodie_commit_seqno='20220301162114472_0_4', _hoodie_record_key='db59af46-0fcf-4bb7-ab4a-b9387fb710d3', _hoodie_partition_path='americas/united_states/san_francisco', _hoodie_file_name='2307cfa5-13d3-481d-8365-d8f8f4e1027a-0_0-35-58_20220301162114472.parquet', begin_lat=0.6220454661413275, begin_lon=0.72024792576853, driver='driver-226', end_lat=0.9048755755365163, end_lon=0.727695054518325, fare=40.613510977307, rider='rider-226', ts=1645914401180, uuid='db59af46-0fcf-4bb7-ab4a-b9387fb710d3', partitionpath='americas/united_states/san_francisco')]
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+----------------+----------+------------------+-----------------+---------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|         begin_lat|       begin_lon|    driver|           end_lat|          end_lon|           fare|    rider|           ts|                uuid|       partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+----------------+----------+------------------+-----------------+---------------+---------+-------------+--------------------+--------------------+
|  20220301162114472|20220301162114472...|db59af46-0fcf-4bb...|  americas/united_s...|2307cfa5-13d3-481...|0.6220454661413275|0.72024792576853|driver-226|0.9048755755365163|0.727695054518325|40.613510977307|rider-226|1645914401180|db59af46-0fcf-4bb...|americas/united_s...|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+----------------+----------+------------------+-----------------+---------------+---------+-------------+--------------------+--------------------+


Process finished with exit code 0

FAQ

‘JavaPackage’ object is not callable

ERROR INFO

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic.py
22/03/01 10:53:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 10:53:04 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
  File "/Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic.py", line 13, in 
    dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
TypeError: 'JavaPackage' object is not callable

Process finished with exit code 1

Solution

spark运行时候缺失jar包,将需要的jar包使用.config("spark.jars", "${YOUR_JAR_PATH})指明即可。

java.lang.ClassNotFoundException: hudi.DefaultSource

ERROR INFO

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py
22/03/01 11:11:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 11:11:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
  File "/Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py", line 12, in <module>
    tripsSnapshotDF = spark. \
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 204, in load
    return self._df(self._jreader.load(path))
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: hudi.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
	... 14 more


Process finished with exit code 1

Solution

spark运行时候缺失jar包,将需要的jar包使用.config("spark.jars", "${YOUR_JAR_PATH})指明即可。

Data must have been written before performing the update operation

在使用「dataGen.generateUpdates(10)」的时候报的错,无法生成更新的数据

ERROR INFO

/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_update.py
22/03/01 13:15:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 13:15:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
  File "/Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_update.py", line 29, in <module>
    updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o35.generateUpdates.
: org.apache.hudi.exception.HoodieException: Data must have been written before performing the update operation
	at org.apache.hudi.QuickstartUtils$DataGenerator.generateUpdates(QuickstartUtils.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)


Process finished with exit code 1

Solution

根据报错信息可知,需要在update之前有一个insert动作,而对于dataGen来说,insert就是生成insert的数据,应该是为了基于insert的数据生成update的数据,所以,解决这个报错的方式就是在调用「dataGen.generateUpdates(10)」先调用一次「dataGen.generateInserts(10)」

参考文档

[1] Apache Hudi官方文档: https://hudi.apache.org/docs/quick-start-guide

你可能感兴趣的:(big,data,python,hudi,大数据,pyspark)