本文参考 Hudi 的官方文档中quick-start-guide部分完成 : https://hudi.apache.org/docs/quick-start-guide
(单纯照着文档来我是没搞起来,总之踩了一些小坑)
请严格遵守jar包版本的约束
Hudi | Supported Spark 3 version |
---|---|
0.10.0 | 3.1.x (default build), 3.0.x |
0.7.0 - 0.9.0 | 3.0.x |
0.6.0 and prior | not supported |
这个步骤是根据官网的描述操作的,只不过一开始没有进行下载jar,踩了一个微型坑
ERROR INFO:
(venv) gavin@GavindeMacBook-Pro test % which python3.8
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python3.8
(venv) gavin@GavindeMacBook-Pro test % export PYSPARK_PYTHON=$(which python3.8)
(venv) gavin@GavindeMacBook-Pro test % echo ${PYSPARK_PYTHON}
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python3.8
(venv) gavin@GavindeMacBook-Pro test % pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Python 3.8.9 (default, Oct 26 2021, 07:25:54)
[Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
:: loading settings :: url = jar:file:/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/gavin/.ivy2/cache
The jars for the packages stored in: /Users/gavin/.ivy2/jars
org.apache.hudi#hudi-spark3.1.2-bundle_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-311b5800-4168-498b-987f-f714233cf50c;1.0
confs: [default]
:: resolution report :: resolve 614199ms :: artifacts dl 0ms
:: modules in use:
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 0 | 0 |
---------------------------------------------------------------------
:: problems summary ::
:::: WARNINGS
module not found: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1
==== local-m2-cache: tried
file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom
-- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:
file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar
==== local-ivy-cache: tried
/Users/gavin/.ivy2/local/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/ivys/ivy.xml
-- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:
/Users/gavin/.ivy2/local/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/jars/hudi-spark3.1.2-bundle_2.12.jar
==== central: tried
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom
-- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar
==== spark-packages: tried
https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom
-- artifact org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar:
https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar
module not found: org.apache.spark#spark-avro_2.12;3.1.2
==== local-m2-cache: tried
file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom
-- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:
file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar
==== local-ivy-cache: tried
/Users/gavin/.ivy2/local/org.apache.spark/spark-avro_2.12/3.1.2/ivys/ivy.xml
-- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:
/Users/gavin/.ivy2/local/org.apache.spark/spark-avro_2.12/3.1.2/jars/spark-avro_2.12.jar
==== central: tried
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom
-- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar
==== spark-packages: tried
https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom
-- artifact org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar:
https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar
::::::::::::::::::::::::::::::::::::::::::::::
:: UNRESOLVED DEPENDENCIES ::
::::::::::::::::::::::::::::::::::::::::::::::
:: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1: not found
:: org.apache.spark#spark-avro_2.12;3.1.2: not found
::::::::::::::::::::::::::::::::::::::::::::::
:::: ERRORS
Server access error at url https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom (java.net.ConnectException: Operation timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar (java.net.ConnectException: Operation timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.pom (java.net.ConnectException: Operation timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar (java.net.ConnectException: Operation timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom (java.net.ConnectException: Operation timed out (Connection timed out))
Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar (java.net.ConnectException: Operation timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.pom (java.net.ConnectException: Operation timed out (Connection timed out))
Server access error at url https://repos.spark-packages.org/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar (java.net.ConnectException: Operation timed out (Connection timed out))
:: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1: not found, unresolved dependency: org.apache.spark#spark-avro_2.12;3.1.2: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1429)
at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:308)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Traceback (most recent call last):
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/python/pyspark/shell.py", line 35, in <module>
SparkContext._ensure_initialized() # type: ignore
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/context.py", line 331, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/java_gateway.py", line 108, in launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
>>> exit()
观察到回去本地maven仓库查找jar包,于是使用maven下载需要的两个jar包(对应的spark版本是3.1.2):
<dependency>
<groupId>org.apache.hudigroupId>
<artifactId>hudi-spark3.1.2-bundle_2.12artifactId>
<version>0.10.1version>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-avro_2.12artifactId>
<version>3.1.2version>
dependency>
也可以自己直接到网站下载:
Hudi jar download:
https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/
https://repo1.maven.org/maven2/org/apache/hudi/hudi/
maven中jar下载完成之后,key在本地仓库中看到jar:/Users/gavin/.m2/repository/org/apache/hudi/…
(venv) gavin@GavindeMacBook-Pro test % pyspark --packages org.apache.hudi:hudi-spark3.1.2-bundle_2.12:0.10.1,org.apache.spark:spark-avro_2.12:3.1.2 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Python 3.8.9 (default, Oct 26 2021, 07:25:54)
[Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
22/03/01 10:20:28 WARN Utils: Your hostname, GavindeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.24.227 instead (on interface en0)
22/03/01 10:20:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/gavin/.ivy2/cache
The jars for the packages stored in: /Users/gavin/.ivy2/jars
org.apache.hudi#hudi-spark3.1.2-bundle_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9a87dae7-3c6a-4133-838b-c7050b1d8b89;1.0
confs: [default]
found org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1 in local-m2-cache
found org.apache.spark#spark-avro_2.12;3.1.2 in local-m2-cache
found org.spark-project.spark#unused;1.0.0 in local-m2-cache
downloading file:/Users/gavin/.m2/repository/org/apache/hudi/hudi-spark3.1.2-bundle_2.12/0.10.1/hudi-spark3.1.2-bundle_2.12-0.10.1.jar ...
[SUCCESSFUL ] org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1!hudi-spark3.1.2-bundle_2.12.jar (54ms)
downloading file:/Users/gavin/.m2/repository/org/apache/spark/spark-avro_2.12/3.1.2/spark-avro_2.12-3.1.2.jar ...
[SUCCESSFUL ] org.apache.spark#spark-avro_2.12;3.1.2!spark-avro_2.12.jar (2ms)
downloading file:/Users/gavin/.m2/repository/org/spark-project/spark/unused/1.0.0/unused-1.0.0.jar ...
[SUCCESSFUL ] org.spark-project.spark#unused;1.0.0!unused.jar (2ms)
:: resolution report :: resolve 6622ms :: artifacts dl 62ms
:: modules in use:
org.apache.hudi#hudi-spark3.1.2-bundle_2.12;0.10.1 from local-m2-cache in [default]
org.apache.spark#spark-avro_2.12;3.1.2 from local-m2-cache in [default]
org.spark-project.spark#unused;1.0.0 from local-m2-cache in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 3 | 3 | 0 || 3 | 3 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-9a87dae7-3c6a-4133-838b-c7050b1d8b89
confs: [default]
3 artifacts copied, 0 already retrieved (38092kB/67ms)
22/03/01 10:20:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Python version 3.8.9 (default, Oct 26 2021 07:25:54)
Spark context Web UI available at http://192.168.24.227:4040
Spark context available as 'sc' (master = local[*], app id = local-1646101237379).
SparkSession available as 'spark'.
>>>
此例讲述的是「upsert」类型的插入,即「存在符合条件的则更新,不存在则新增」,具体是使用什么类型的数据插入方式,是由参数「hoodie.datasource.write.operation」控制的,具体参数的说明可见https://hudi.apache.org/docs/configurations
根据官网的document(https://hudi.apache.org/docs/quick-start-guide),得到如下代码:
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars", "/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
# pyspark
tableName = "hudi_trips_cow"
basePath = "file:///tmp/hudi_trips_cow"
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
# pyspark
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(10))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'partitionpath',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
df.write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save(basePath)
ps:插入的数据使用的是「org.apache.hudi.QuickstartUtils.DataGenerator()」生成的样例数据(官网的代码就是这么干的,具体数据内容可参见查询 章节)
运行代码后得到如下目录结构(妥妥的分区表目录结构):
gavin@GavindeMacBook-Pro apache % tree /tmp/hudi_trips_cow
/tmp/hudi_trips_cow
├── americas
│ ├── brazil
│ │ └── sao_paulo
│ │ └── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-27-31_20220301105108074.parquet
│ └── united_states
│ └── san_francisco
│ └── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-33-32_20220301105108074.parquet
└── asia
└── india
└── chennai
└── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-33-33_20220301105108074.parquet
8 directories, 3 files
gavin@GavindeMacBook-Pro apache %
扩展
除了基本的数据文件外,hudi还有一个metadata的隐藏文件「.hoodie」,文件具体内容再叙:
gavin@GavindeMacBook-Pro hudi_trips_cow % ll -a
total 0
drwxr-xr-x 5 gavin wheel 160 Mar 1 10:51 .
drwxrwxrwt 10 root wheel 320 Mar 1 11:26 ..
drwxr-xr-x 13 gavin wheel 416 Mar 1 10:51 .hoodie
drwxr-xr-x 4 gavin wheel 128 Mar 1 10:51 americas
drwxr-xr-x 3 gavin wheel 96 Mar 1 10:51 asia
gavin@GavindeMacBook-Pro hudi_trips_cow % tree .hoodie
.hoodie
├── 20220301105108074.commit
├── 20220301105108074.commit.requested
├── 20220301105108074.inflight
├── archived
└── hoodie.properties
查询代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
basePath = "file:///tmp/hudi_trips_cow"
# pyspark
tripsSnapshotDF = spark. \
read. \
format("hudi"). \
load(basePath)
# load(basePath) use "/partitionKey=partitionValue" folder structure for Spark auto partition discovery
count = tripsSnapshotDF.count()
print(f'========hudi_trips_snapshot 表中共计[{count}]条数据')
print('表结构如下:')
tripsSnapshotDF.printSchema()
tripsSnapshotDF.show()
tripsSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
spark.sql("select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0").show()
spark.sql(
"select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot").show()
查询结果
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py
22/03/01 11:18:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 11:18:27 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
========hudi_trips_snapshot 表中共计[10]条数据
表结构如下:
root
|-- _hoodie_commit_time: string (nullable = true)
|-- _hoodie_commit_seqno: string (nullable = true)
|-- _hoodie_record_key: string (nullable = true)
|-- _hoodie_partition_path: string (nullable = true)
|-- _hoodie_file_name: string (nullable = true)
|-- begin_lat: double (nullable = true)
|-- begin_lon: double (nullable = true)
|-- driver: string (nullable = true)
|-- end_lat: double (nullable = true)
|-- end_lon: double (nullable = true)
|-- fare: double (nullable = true)
|-- rider: string (nullable = true)
|-- ts: long (nullable = true)
|-- uuid: string (nullable = true)
|-- partitionpath: string (nullable = true)
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| begin_lat| begin_lon| driver| end_lat| end_lon| fare| rider| ts| uuid| partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
| 20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...| americas/united_s...|52a5ee08-9376-495...|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
| 20220301105108074|20220301105108074...|67ee90ec-d7b8-477...| americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
| 20220301105108074|20220301105108074...|91703076-f580-49f...| americas/united_s...|52a5ee08-9376-495...|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
| 20220301105108074|20220301105108074...|96a7571e-1e54-4bc...| americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302| 0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
| 20220301105108074|20220301105108074...|3723b4ac-8841-4cd...| americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
| 20220301105108074|20220301105108074...|b3bf0b93-768d-4be...| americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655| 43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...| americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653|0.46157858450465483|driver-213| 0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|3409ecd2-02c2-40c...| americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|60903a00-3fdc-45d...| asia/india/chennai|2f5b659d-3738-48c...| 0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...| asia/india/chennai|
| 20220301105108074|20220301105108074...|22d1507b-7d02-402...| asia/india/chennai|2f5b659d-3738-48c...| 0.40613510977307| 0.5644092139040959|driver-213| 0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...| asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
+------------------+-------------------+-------------------+-------------+
| fare| begin_lon| begin_lat| ts|
+------------------+-------------------+-------------------+-------------+
| 93.56018115236618|0.14285051259466197|0.21624150367601136|1645736908516|
| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1645565690012|
| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1646031306513|
| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1646085368961|
| 43.4923811219014| 0.8779402295427752| 0.6100070562136587|1645868768394|
|34.158284716382845|0.46157858450465483| 0.4726905879569653|1645602479789|
| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1645621352954|
| 41.06290929046368| 0.8192868687714224| 0.651058505660742|1645503078006|
+------------------+-------------------+-------------------+-------------+
+-------------------+--------------------+----------------------+---------+----------+------------------+
|_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare|
+-------------------+--------------------+----------------------+---------+----------+------------------+
| 20220301105108074|c4340e2c-efd2-4a9...| americas/united_s...|rider-213|driver-213| 93.56018115236618|
| 20220301105108074|67ee90ec-d7b8-477...| americas/united_s...|rider-213|driver-213| 64.27696295884016|
| 20220301105108074|91703076-f580-49f...| americas/united_s...|rider-213|driver-213| 27.79478688582596|
| 20220301105108074|96a7571e-1e54-4bc...| americas/united_s...|rider-213|driver-213|19.179139106643607|
| 20220301105108074|3723b4ac-8841-4cd...| americas/united_s...|rider-213|driver-213| 33.92216483948643|
| 20220301105108074|b3bf0b93-768d-4be...| americas/brazil/s...|rider-213|driver-213| 43.4923811219014|
| 20220301105108074|7e195e8d-c6df-4fd...| americas/brazil/s...|rider-213|driver-213|34.158284716382845|
| 20220301105108074|3409ecd2-02c2-40c...| americas/brazil/s...|rider-213|driver-213| 66.62084366450246|
| 20220301105108074|60903a00-3fdc-45d...| asia/india/chennai|rider-213|driver-213| 41.06290929046368|
| 20220301105108074|22d1507b-7d02-402...| asia/india/chennai|rider-213|driver-213|17.851135255091155|
+-------------------+--------------------+----------------------+---------+----------+------------------+
Process finished with exit code 0
查询代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
basePath = "file:///tmp/hudi_trips_cow"
# 以下列举三种查询方式
# pyspark
spark.read. \
format("hudi"). \
option("as.of.instant", "20210728141108"). \
load(basePath).show()
spark.read. \
format("hudi"). \
option("as.of.instant", "2022-02-28 14:11:08.000"). \
load(basePath).show()
# It is equal to "as.of.instant = 2021-07-28 00:00:00"
spark.read. \
format("hudi"). \
option("as.of.instant", "2022-07-28"). \
load(basePath).show()
Ps:数据的产生时间是「20220301」
查询结果
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query_time_travel.py
22/03/01 11:30:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 11:30:10 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|begin_lat|begin_lon|driver|end_lat|end_lon|fare|rider| ts|uuid|partitionpath|
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|begin_lat|begin_lon|driver|end_lat|end_lon|fare|rider| ts|uuid|partitionpath|
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
+-------------------+--------------------+------------------+----------------------+-----------------+---------+---------+------+-------+-------+----+-----+---+----+-------------+
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| begin_lat| begin_lon| driver| end_lat| end_lon| fare| rider| ts| uuid| partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
| 20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...| americas/united_s...|52a5ee08-9376-495...|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
| 20220301105108074|20220301105108074...|67ee90ec-d7b8-477...| americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
| 20220301105108074|20220301105108074...|91703076-f580-49f...| americas/united_s...|52a5ee08-9376-495...|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
| 20220301105108074|20220301105108074...|96a7571e-1e54-4bc...| americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302| 0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
| 20220301105108074|20220301105108074...|3723b4ac-8841-4cd...| americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
| 20220301105108074|20220301105108074...|b3bf0b93-768d-4be...| americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655| 43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...| americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653|0.46157858450465483|driver-213| 0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|3409ecd2-02c2-40c...| americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|60903a00-3fdc-45d...| asia/india/chennai|2f5b659d-3738-48c...| 0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...| asia/india/chennai|
| 20220301105108074|20220301105108074...|22d1507b-7d02-402...| asia/india/chennai|2f5b659d-3738-48c...| 0.40613510977307| 0.5644092139040959|driver-213| 0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...| asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
Process finished with exit code 0
代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
basePath = "file:///tmp/hudi_trips_cow"
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
tableName = "hudi_trips_cow"
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'partitionpath',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
#由于我将update操作与insert的步骤独立出来了,如果直接使用dataGen.generateUpdates(10)会报错,需要先执行一个生成insert数据的动作
#由于我将update操作与insert的步骤独立出来,所以这个更新的数据和之前插入的数据和在同一个代码中得到的会不同。不过影响不大。
dataGen.generateInserts(10)
# pyspark
updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
df.show()
df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(basePath)
运行日志
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_update.py
22/03/01 13:28:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 13:28:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
+--------------------+--------------------+----------+-------------------+-------------------+------------------+--------------------+---------+-------------+--------------------+
| begin_lat| begin_lon| driver| end_lat| end_lon| fare| partitionpath| rider| ts| uuid|
+--------------------+--------------------+----------+-------------------+-------------------+------------------+--------------------+---------+-------------+--------------------+
| 0.7340133901254792| 0.5142184937933181|driver-284| 0.7814655558162802| 0.6592596683641996|49.527694252432056|americas/united_s...|rider-284|1645635237429|0456f152-9a1b-48f...|
| 0.1593867607188556|0.010872312870502165|driver-284| 0.9808530350038475| 0.7963756520507014| 29.47661370147079|americas/brazil/s...|rider-284|1645899391613|1e34b971-3dfc-489...|
| 0.7180196467760873| 0.13755354862499358|driver-284| 0.3037264771699937| 0.2539047155055727| 86.75932789048282|americas/brazil/s...|rider-284|1645890122334|1e34b971-3dfc-489...|
| 0.6570857443423376| 0.888493603696927|driver-284| 0.9036309069576131|0.37603706507284995| 63.72504913279929|americas/brazil/s...|rider-284|1645547517087|8d784cd0-02d9-429...|
| 0.08528650347654165| 0.4006983139989222|driver-284| 0.1975324518739051| 0.908216792146506| 90.25710109008239| asia/india/chennai|rider-284|1646095456906|bc2e551e-4206-4f0...|
| 0.18294079059016366| 0.19949323322922063|driver-284|0.24749642418050566| 0.1751761658135068| 90.9053809533154|americas/united_s...|rider-284|1645675773158|da50d4f5-94cb-41c...|
| 0.4777395067707303| 0.3349917833248327|driver-284| 0.9735699951963335| 0.8144901865212508| 98.3428192817987|americas/united_s...|rider-284|1646066699577|a24084ea-4473-459...|
|0.014159831486388885| 0.42849372303000655|driver-284| 0.9968531966280192| 0.9451993293955782| 2.375516772415698|americas/united_s...|rider-284|1645728852563|da50d4f5-94cb-41c...|
| 0.16603428449020086| 0.6999655248704163|driver-284| 0.5086437188581894| 0.6242134749327686| 9.384124531808036| asia/india/chennai|rider-284|1645620049479|9cf010a9-7303-4c5...|
| 0.2110206104048945| 0.2783086084578943|driver-284|0.12154541219767523| 0.8700506703716298| 91.99515909032544|americas/brazil/s...|rider-284|1645773817699|1e34b971-3dfc-489...|
+--------------------+--------------------+----------+-------------------+-------------------+------------------+--------------------+---------+-------------+--------------------+
22/03/01 13:28:27 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/03/01 13:28:27 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
Process finished with exit code 0
更新后的目录结构
# 更新之前
gavin@GavindeMacBook-Pro .hoodie % ll
total 32
-rw-r--r-- 1 gavin wheel 4374 Mar 1 10:51 20220301105108074.commit
-rw-r--r-- 1 gavin wheel 0 Mar 1 10:51 20220301105108074.commit.requested
-rw-r--r-- 1 gavin wheel 2594 Mar 1 10:51 20220301105108074.inflight
drwxr-xr-x 2 gavin wheel 64 Mar 1 10:51 archived
-rw-r--r-- 1 gavin wheel 600 Mar 1 10:51 hoodie.properties
# 更新之后
gavin@GavindeMacBook-Pro .hoodie % ll
total 56
-rw-r--r-- 1 gavin wheel 4374 Mar 1 10:51 20220301105108074.commit
-rw-r--r-- 1 gavin wheel 0 Mar 1 10:51 20220301105108074.commit.requested
-rw-r--r-- 1 gavin wheel 2594 Mar 1 10:51 20220301105108074.inflight
-rw-r--r-- 1 gavin wheel 4413 Mar 1 13:28 20220301132827300.commit
-rw-r--r-- 1 gavin wheel 0 Mar 1 13:28 20220301132827300.commit.requested
-rw-r--r-- 1 gavin wheel 2594 Mar 1 13:28 20220301132827300.inflight
drwxr-xr-x 2 gavin wheel 64 Mar 1 10:51 archived
-rw-r--r-- 1 gavin wheel 600 Mar 1 10:51 hoodie.properties
gavin@GavindeMacBook-Pro .hoodie %
#更新之前
gavin@GavindeMacBook-Pro apache % tree /tmp/hudi_trips_cow
/tmp/hudi_trips_cow
├── americas
│ ├── brazil
│ │ └── sao_paulo
│ │ └── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-27-31_20220301105108074.parquet
│ └── united_states
│ └── san_francisco
│ └── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-33-32_20220301105108074.parquet
└── asia
└── india
└── chennai
└── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-33-33_20220301105108074.parquet
8 directories, 3 files
gavin@GavindeMacBook-Pro apache %
#更新之后
gavin@GavindeMacBook-Pro hudi_trips_cow % tree ./*
./americas
├── brazil
│ └── sao_paulo
│ ├── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-27-31_20220301105108074.parquet
│ └── 6f82f351-9994-459d-a20c-77baa91ad323-0_0-29-39_20220301132827300.parquet
└── united_states
└── san_francisco
├── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-33-32_20220301105108074.parquet
└── 52a5ee08-9376-4954-bb8f-f7f519b8b40e-0_1-35-40_20220301132827300.parquet
./asia
└── india
└── chennai
├── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-33-33_20220301105108074.parquet
└── 2f5b659d-3738-48ca-b590-bbce52e98642-0_2-35-41_20220301132827300.parquet
6 directories, 6 files
gavin@GavindeMacBook-Pro hudi_trips_cow %
额外做一个查询,看看当前数据是否新增了:
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py
22/03/01 13:34:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 13:34:56 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
========hudi_trips_snapshot 表中共计[17]条数据
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| begin_lat| begin_lon| driver| end_lat| end_lon| fare| rider| ts| uuid| partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
| 20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...| americas/united_s...|52a5ee08-9376-495...| 0.21624150367601136| 0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
| 20220301105108074|20220301105108074...|67ee90ec-d7b8-477...| americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
| 20220301105108074|20220301105108074...|91703076-f580-49f...| americas/united_s...|52a5ee08-9376-495...| 0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
| 20220301105108074|20220301105108074...|96a7571e-1e54-4bc...| americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302| 0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
| 20220301105108074|20220301105108074...|3723b4ac-8841-4cd...| americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
| 20220301132827300|20220301132827300...|a24084ea-4473-459...| americas/united_s...|52a5ee08-9376-495...| 0.4777395067707303| 0.3349917833248327|driver-284| 0.9735699951963335| 0.8144901865212508| 98.3428192817987|rider-284|1646066699577|a24084ea-4473-459...|americas/united_s...|
| 20220301132827300|20220301132827300...|0456f152-9a1b-48f...| americas/united_s...|52a5ee08-9376-495...| 0.7340133901254792| 0.5142184937933181|driver-284| 0.7814655558162802| 0.6592596683641996|49.527694252432056|rider-284|1645635237429|0456f152-9a1b-48f...|americas/united_s...|
| 20220301132827300|20220301132827300...|da50d4f5-94cb-41c...| americas/united_s...|52a5ee08-9376-495...|0.014159831486388885| 0.42849372303000655|driver-284| 0.9968531966280192| 0.9451993293955782| 2.375516772415698|rider-284|1645728852563|da50d4f5-94cb-41c...|americas/united_s...|
| 20220301105108074|20220301105108074...|b3bf0b93-768d-4be...| americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655| 43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...| americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653| 0.46157858450465483|driver-213| 0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|3409ecd2-02c2-40c...| americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035| 0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
| 20220301132827300|20220301132827300...|8d784cd0-02d9-429...| americas/brazil/s...|6f82f351-9994-459...| 0.6570857443423376| 0.888493603696927|driver-284| 0.9036309069576131|0.37603706507284995| 63.72504913279929|rider-284|1645547517087|8d784cd0-02d9-429...|americas/brazil/s...|
| 20220301132827300|20220301132827300...|1e34b971-3dfc-489...| americas/brazil/s...|6f82f351-9994-459...| 0.1593867607188556|0.010872312870502165|driver-284| 0.9808530350038475| 0.7963756520507014| 29.47661370147079|rider-284|1645899391613|1e34b971-3dfc-489...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|60903a00-3fdc-45d...| asia/india/chennai|2f5b659d-3738-48c...| 0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...| asia/india/chennai|
| 20220301105108074|20220301105108074...|22d1507b-7d02-402...| asia/india/chennai|2f5b659d-3738-48c...| 0.40613510977307| 0.5644092139040959|driver-213| 0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...| asia/india/chennai|
| 20220301132827300|20220301132827300...|bc2e551e-4206-4f0...| asia/india/chennai|2f5b659d-3738-48c...| 0.08528650347654165| 0.4006983139989222|driver-284| 0.1975324518739051| 0.908216792146506| 90.25710109008239|rider-284|1646095456906|bc2e551e-4206-4f0...| asia/india/chennai|
| 20220301132827300|20220301132827300...|9cf010a9-7303-4c5...| asia/india/chennai|2f5b659d-3738-48c...| 0.16603428449020086| 0.6999655248704163|driver-284| 0.5086437188581894| 0.6242134749327686| 9.384124531808036|rider-284|1645620049479|9cf010a9-7303-4c5...| asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
Process finished with exit code 0
Ps:上述中数据新增了7条,但是生成的update数据有10条,采用的是append的方式进行更新;得到的结果是更新之后数据有17条;这个是因为数据append不是单纯的追加,而是使用「‘hoodie.datasource.write.operation’: ‘upsert’」的option的追加。也就是默认表已经存在,将新数据进行upsert。官网中描述的是『 In general, always use append mode unless you are trying to create the table for the first time』。所以对于append
与overwrite
两种模式的选择,如果不是首次建表,基本都选择append
。
扩展·查询历史版本
根据第二次的提交记录可得到一个准确的时间点为「20220301132827300」:
-rw-r--r-- 1 gavin wheel 4413 Mar 1 13:28 20220301132827300.commit
然后查到更新之前的版本数据:
spark.read.format("hudi").option("as.of.instant", "20220301132827300").load(basePath).show()
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Python version 3.8.9 (default, Oct 26 2021 07:25:54)
Spark context Web UI available at http://192.168.24.227:4040
Spark context available as 'sc' (master = local[*], app id = local-1646101237379).
SparkSession available as 'spark'.
>>> basePath = "file:///tmp/hudi_trips_cow"
>>> df = spark.read.format("hudi").option("as.of.instant", "20220301132827300").load(basePath)
>>> df.count()
17
>>> df.show()
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| begin_lat| begin_lon| driver| end_lat| end_lon| fare| rider| ts| uuid| partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
| 20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...| americas/united_s...|52a5ee08-9376-495...| 0.21624150367601136| 0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
| 20220301105108074|20220301105108074...|67ee90ec-d7b8-477...| americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
| 20220301105108074|20220301105108074...|91703076-f580-49f...| americas/united_s...|52a5ee08-9376-495...| 0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
| 20220301105108074|20220301105108074...|96a7571e-1e54-4bc...| americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302| 0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
| 20220301105108074|20220301105108074...|3723b4ac-8841-4cd...| americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
| 20220301132827300|20220301132827300...|a24084ea-4473-459...| americas/united_s...|52a5ee08-9376-495...| 0.4777395067707303| 0.3349917833248327|driver-284| 0.9735699951963335| 0.8144901865212508| 98.3428192817987|rider-284|1646066699577|a24084ea-4473-459...|americas/united_s...|
| 20220301132827300|20220301132827300...|0456f152-9a1b-48f...| americas/united_s...|52a5ee08-9376-495...| 0.7340133901254792| 0.5142184937933181|driver-284| 0.7814655558162802| 0.6592596683641996|49.527694252432056|rider-284|1645635237429|0456f152-9a1b-48f...|americas/united_s...|
| 20220301132827300|20220301132827300...|da50d4f5-94cb-41c...| americas/united_s...|52a5ee08-9376-495...|0.014159831486388885| 0.42849372303000655|driver-284| 0.9968531966280192| 0.9451993293955782| 2.375516772415698|rider-284|1645728852563|da50d4f5-94cb-41c...|americas/united_s...|
| 20220301105108074|20220301105108074...|b3bf0b93-768d-4be...| americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655| 43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...| americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653| 0.46157858450465483|driver-213| 0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|3409ecd2-02c2-40c...| americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035| 0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
| 20220301132827300|20220301132827300...|8d784cd0-02d9-429...| americas/brazil/s...|6f82f351-9994-459...| 0.6570857443423376| 0.888493603696927|driver-284| 0.9036309069576131|0.37603706507284995| 63.72504913279929|rider-284|1645547517087|8d784cd0-02d9-429...|americas/brazil/s...|
| 20220301132827300|20220301132827300...|1e34b971-3dfc-489...| americas/brazil/s...|6f82f351-9994-459...| 0.1593867607188556|0.010872312870502165|driver-284| 0.9808530350038475| 0.7963756520507014| 29.47661370147079|rider-284|1645899391613|1e34b971-3dfc-489...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|60903a00-3fdc-45d...| asia/india/chennai|2f5b659d-3738-48c...| 0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...| asia/india/chennai|
| 20220301105108074|20220301105108074...|22d1507b-7d02-402...| asia/india/chennai|2f5b659d-3738-48c...| 0.40613510977307| 0.5644092139040959|driver-213| 0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...| asia/india/chennai|
| 20220301132827300|20220301132827300...|bc2e551e-4206-4f0...| asia/india/chennai|2f5b659d-3738-48c...| 0.08528650347654165| 0.4006983139989222|driver-284| 0.1975324518739051| 0.908216792146506| 90.25710109008239|rider-284|1646095456906|bc2e551e-4206-4f0...| asia/india/chennai|
| 20220301132827300|20220301132827300...|9cf010a9-7303-4c5...| asia/india/chennai|2f5b659d-3738-48c...| 0.16603428449020086| 0.6999655248704163|driver-284| 0.5086437188581894| 0.6242134749327686| 9.384124531808036|rider-284|1645620049479|9cf010a9-7303-4c5...| asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+--------------------+--------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
## 查询更新之前的数据20220301132827300 -> 20220301132527300
>>> df_before_update = spark.read.format("hudi").option("as.of.instant", "20220301132527300").load(basePath)
>>> df_before_update.count()
10
>>> df_before_update.show()
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| begin_lat| begin_lon| driver| end_lat| end_lon| fare| rider| ts| uuid| partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
| 20220301105108074|20220301105108074...|c4340e2c-efd2-4a9...| americas/united_s...|52a5ee08-9376-495...|0.21624150367601136|0.14285051259466197|driver-213| 0.5890949624813784| 0.0966823831927115| 93.56018115236618|rider-213|1645736908516|c4340e2c-efd2-4a9...|americas/united_s...|
| 20220301105108074|20220301105108074...|67ee90ec-d7b8-477...| americas/united_s...|52a5ee08-9376-495...| 0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645565690012|67ee90ec-d7b8-477...|americas/united_s...|
| 20220301105108074|20220301105108074...|91703076-f580-49f...| americas/united_s...|52a5ee08-9376-495...|0.11488393157088261| 0.6273212202489661|driver-213| 0.7454678537511295| 0.3954939864908973| 27.79478688582596|rider-213|1646031306513|91703076-f580-49f...|americas/united_s...|
| 20220301105108074|20220301105108074...|96a7571e-1e54-4bc...| americas/united_s...|52a5ee08-9376-495...| 0.8742041526408587| 0.7528268153249502|driver-213| 0.9197827128888302| 0.362464770874404|19.179139106643607|rider-213|1645796169470|96a7571e-1e54-4bc...|americas/united_s...|
| 20220301105108074|20220301105108074...|3723b4ac-8841-4cd...| americas/united_s...|52a5ee08-9376-495...| 0.1856488085068272| 0.9694586417848392|driver-213|0.38186367037201974|0.25252652214479043| 33.92216483948643|rider-213|1646085368961|3723b4ac-8841-4cd...|americas/united_s...|
| 20220301105108074|20220301105108074...|b3bf0b93-768d-4be...| americas/brazil/s...|6f82f351-9994-459...| 0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655| 43.4923811219014|rider-213|1645868768394|b3bf0b93-768d-4be...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|7e195e8d-c6df-4fd...| americas/brazil/s...|6f82f351-9994-459...| 0.4726905879569653|0.46157858450465483|driver-213| 0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645602479789|7e195e8d-c6df-4fd...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|3409ecd2-02c2-40c...| americas/brazil/s...|6f82f351-9994-459...| 0.0750588760043035|0.03844104444445928|driver-213|0.04376353354538354| 0.6346040067610669| 66.62084366450246|rider-213|1645621352954|3409ecd2-02c2-40c...|americas/brazil/s...|
| 20220301105108074|20220301105108074...|60903a00-3fdc-45d...| asia/india/chennai|2f5b659d-3738-48c...| 0.651058505660742| 0.8192868687714224|driver-213|0.20714896002914462|0.06224031095826987| 41.06290929046368|rider-213|1645503078006|60903a00-3fdc-45d...| asia/india/chennai|
| 20220301105108074|20220301105108074...|22d1507b-7d02-402...| asia/india/chennai|2f5b659d-3738-48c...| 0.40613510977307| 0.5644092139040959|driver-213| 0.798706304941517|0.02698359227182834|17.851135255091155|rider-213|1645948641664|22d1507b-7d02-402...| asia/india/chennai|
+-------------------+--------------------+--------------------+----------------------+--------------------+-------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
# pyspark
tableName = "hudi_trips_cow"
basePath = "file:///tmp/hudi_trips_cow"
# pyspark
# reload data
spark. \
read. \
format("hudi"). \
load(basePath). \
createOrReplaceTempView("hudi_trips_snapshot")
#先获取所有的已经存在的提交时间,再取倒数第二个作为增量查询的开始时间进行查询(不设置增量查询结束时间则表示查询开始时间之后的所有数据)
commits = list(map(lambda row: row[0], spark.sql(
"select distinct(_hoodie_commit_time) as commitTime from hudi_trips_snapshot order by commitTime").limit(
50).collect()))
print(f'get commit info [{commits}]')
beginTime = commits[len(commits) - 2] # commit time we are interested in
print(f'set beginTime as [{beginTime}]')
# incrementally query data
incremental_read_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': beginTime,
}
tripsIncrementalDF = spark.read.format("hudi"). \
options(**incremental_read_options). \
load(basePath)
tripsIncrementalDF.createOrReplaceTempView("hudi_trips_incremental")
spark.sql(
"select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0").show()
执行结果
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic/basic_incremental_query.py
22/03/01 15:08:26 WARN Utils: Your hostname, GavindeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.196.59.24 instead (on interface en0)
22/03/01 15:08:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/03/01 15:08:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 15:08:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
get commit info [['20220301105108074', '20220301132827300']]
set beginTime as [20220301105108074]
+-------------------+------------------+--------------------+-------------------+-------------+
|_hoodie_commit_time| fare| begin_lon| begin_lat| ts|
+-------------------+------------------+--------------------+-------------------+-------------+
| 20220301132827300| 98.3428192817987| 0.3349917833248327| 0.4777395067707303|1646066699577|
| 20220301132827300|49.527694252432056| 0.5142184937933181| 0.7340133901254792|1645635237429|
| 20220301132827300| 63.72504913279929| 0.888493603696927| 0.6570857443423376|1645547517087|
| 20220301132827300| 29.47661370147079|0.010872312870502165| 0.1593867607188556|1645899391613|
| 20220301132827300| 90.25710109008239| 0.4006983139989222|0.08528650347654165|1646095456906|
+-------------------+------------------+--------------------+-------------------+-------------+
Process finished with exit code 0
代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
basePath = "file:///tmp/hudi_trips_cow"
# reload data
spark. \
read. \
format("hudi"). \
load(basePath). \
createOrReplaceTempView("hudi_trips_snapshot")
commits = list(map(lambda row: row[0], spark.sql(
"select distinct(_hoodie_commit_time) as commitTime from hudi_trips_snapshot order by commitTime").limit(
50).collect()))
endTime = commits[len(commits) - 2]
beginTime = "000" # Represents all commits > this time.
# query point in time data
point_in_time_read_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.end.instanttime': endTime,
'hoodie.datasource.read.begin.instanttime': beginTime
}
tripsPointInTimeDF = spark.read.format("hudi"). \
options(**point_in_time_read_options). \
load(basePath)
tripsPointInTimeDF.createOrReplaceTempView("hudi_trips_point_in_time")
spark.sql(
"select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0").show()
运行结果
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic/basic_point_in_time_query.py
22/03/01 15:18:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 15:18:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
+-------------------+------------------+-------------------+-------------------+-------------+
|_hoodie_commit_time| fare| begin_lon| begin_lat| ts|
+-------------------+------------------+-------------------+-------------------+-------------+
| 20220301105108074| 93.56018115236618|0.14285051259466197|0.21624150367601136|1645736908516|
| 20220301105108074| 64.27696295884016| 0.4923479652912024| 0.5731835407930634|1645565690012|
| 20220301105108074| 27.79478688582596| 0.6273212202489661|0.11488393157088261|1646031306513|
| 20220301105108074| 33.92216483948643| 0.9694586417848392| 0.1856488085068272|1646085368961|
| 20220301105108074| 43.4923811219014| 0.8779402295427752| 0.6100070562136587|1645868768394|
| 20220301105108074|34.158284716382845|0.46157858450465483| 0.4726905879569653|1645602479789|
| 20220301105108074| 66.62084366450246|0.03844104444445928| 0.0750588760043035|1645621352954|
| 20220301105108074| 41.06290929046368| 0.8192868687714224| 0.651058505660742|1645503078006|
+-------------------+------------------+-------------------+-------------------+-------------+
Process finished with exit code 0
此例中将从原表中随机取出两条数据,然后根据取出数据做一个条件查询删除,将原表中对应的这两条数据进行删除
ps:delete动作只能在append
模式下进行
代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
# pyspark
tableName = "hudi_trips_cow"
basePath = "file:///tmp/hudi_trips_cow"
# pyspark
# reload data
spark. \
read. \
format("hudi"). \
load(basePath). \
createOrReplaceTempView("hudi_trips_snapshot")
before_count = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
print(f'before delete , there exists [{before_count}] records')
# fetch two records to be deleted
ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
print(f'Records will be deleted :[{ds.collect()}]')
# issue deletes
hudi_delete_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'partitionpath',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'delete',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
from pyspark.sql.functions import lit
deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
print(f'deletes data: [{deletes}]')
#在生成的DFs后面再新增一列「ts」
df = spark.sparkContext.parallelize(deletes).toDF(['uuid', 'partitionpath']).withColumn('ts', lit(0.0))
df.write.format("hudi"). \
options(**hudi_delete_options). \
mode("append"). \
save(basePath)
# run the same read query as above.
roAfterDeleteViewDF = spark. \
read. \
format("hudi"). \
load(basePath)
roAfterDeleteViewDF.createOrReplaceTempView("hudi_trips_snapshot")
# fetch should return (total - 2) records
after_count = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
print(f'after delete , there exists [{after_count}] records')
运行结果
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic/basic_delete.py
22/03/01 15:59:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 15:59:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
before delete , there exists [17] records
Records will be deleted :[[Row(uuid='c4340e2c-efd2-4a92-9615-32822599d397', partitionpath='americas/united_states/san_francisco'), Row(uuid='67ee90ec-d7b8-4772-b02e-6a41a7556fa0', partitionpath='americas/united_states/san_francisco')]]
deletes data: [[('c4340e2c-efd2-4a92-9615-32822599d397', 'americas/united_states/san_francisco'), ('67ee90ec-d7b8-4772-b02e-6a41a7556fa0', 'americas/united_states/san_francisco')]]
22/03/01 15:59:34 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/03/01 15:59:34 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
after delete , there exists [15] records
Process finished with exit code 0
之前的介绍了upsert类型的,这里补充一下纯insert的数据插入,不带更新的那种,直接全量覆盖。
代码
import pyspark
if __name__ == '__main__':
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars",
"/Users/gavin/.ivy2/cache/org.apache.hudi/hudi-spark3.1.2-bundle_2.12/jars/hudi-spark3.1.2-bundle_2.12-0.10.1.jar,"
"/Users/gavin/.ivy2/cache/org.apache.spark/spark-avro_2.12/jars/spark-avro_2.12-3.1.2.jar") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
spark = builder.getOrCreate()
sc = spark.sparkContext
# pyspark
tableName = "hudi_trips_cow_insert_overwirte"
basePath = "file:///tmp/hudi_trips_cow_insert_overwirte"
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'partitionpath',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'insert',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2
}
# pyspark
#造3条数据用于演示
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(3))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
print(f'start to insert data into a new table:[{df.collect()}]')
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
print(spark.read.format("hudi").load(basePath).collect())
spark.read.format("hudi").load(basePath).show()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(1))
df_new = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
print(f'start to insert data into a new table:[{df_new.collect()}]')
df_new.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
print(spark.read.format("hudi").load(basePath).collect())
spark.read.format("hudi").load(basePath).show()
运行结果:后面插入的1条数据直接全量覆盖了之前insert的3条数据。最终结果表中的数据是1条,而不是4条
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/extend/insert_overwrite.py
22/03/01 16:21:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 16:21:04 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
start to insert data into a new table:[[Row(begin_lat=0.4726905879569653, begin_lon=0.46157858450465483, driver='driver-213', end_lat=0.754803407008858, end_lon=0.9671159942018241, fare=34.158284716382845, partitionpath='americas/brazil/sao_paulo', rider='rider-213', ts=1645537223581, uuid='a63b8b04-6dcc-4edc-af9f-2b9f6dcfe145'), Row(begin_lat=0.6100070562136587, begin_lon=0.8779402295427752, driver='driver-213', end_lat=0.3407870505929602, end_lon=0.5030798142293655, fare=43.4923811219014, partitionpath='americas/brazil/sao_paulo', rider='rider-213', ts=1645608818472, uuid='172f2894-285a-4b48-97c2-d92bf992697c'), Row(begin_lat=0.5731835407930634, begin_lon=0.4923479652912024, driver='driver-213', end_lat=0.08988581780930216, end_lon=0.42520899698713666, fare=64.27696295884016, partitionpath='americas/united_states/san_francisco', rider='rider-213', ts=1645587087764, uuid='aeea15f6-e5b7-438a-b1c6-c00c19347ca1')]]
22/03/01 16:21:09 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/03/01 16:21:09 WARN DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
[Row(_hoodie_commit_time='20220301162109926', _hoodie_commit_seqno='20220301162109926_1_3', _hoodie_record_key='aeea15f6-e5b7-438a-b1c6-c00c19347ca1', _hoodie_partition_path='americas/united_states/san_francisco', _hoodie_file_name='2cde0990-e337-4e8e-a6df-802f1b45fd53-0_1-10-17_20220301162109926.parquet', begin_lat=0.5731835407930634, begin_lon=0.4923479652912024, driver='driver-213', end_lat=0.08988581780930216, end_lon=0.42520899698713666, fare=64.27696295884016, rider='rider-213', ts=1645587087764, uuid='aeea15f6-e5b7-438a-b1c6-c00c19347ca1', partitionpath='americas/united_states/san_francisco'), Row(_hoodie_commit_time='20220301162109926', _hoodie_commit_seqno='20220301162109926_0_1', _hoodie_record_key='a63b8b04-6dcc-4edc-af9f-2b9f6dcfe145', _hoodie_partition_path='americas/brazil/sao_paulo', _hoodie_file_name='45b0d7ee-88ec-4a35-ad65-be649efe88be-0_0-8-16_20220301162109926.parquet', begin_lat=0.4726905879569653, begin_lon=0.46157858450465483, driver='driver-213', end_lat=0.754803407008858, end_lon=0.9671159942018241, fare=34.158284716382845, rider='rider-213', ts=1645537223581, uuid='a63b8b04-6dcc-4edc-af9f-2b9f6dcfe145', partitionpath='americas/brazil/sao_paulo'), Row(_hoodie_commit_time='20220301162109926', _hoodie_commit_seqno='20220301162109926_0_2', _hoodie_record_key='172f2894-285a-4b48-97c2-d92bf992697c', _hoodie_partition_path='americas/brazil/sao_paulo', _hoodie_file_name='45b0d7ee-88ec-4a35-ad65-be649efe88be-0_0-8-16_20220301162109926.parquet', begin_lat=0.6100070562136587, begin_lon=0.8779402295427752, driver='driver-213', end_lat=0.3407870505929602, end_lon=0.5030798142293655, fare=43.4923811219014, rider='rider-213', ts=1645608818472, uuid='172f2894-285a-4b48-97c2-d92bf992697c', partitionpath='americas/brazil/sao_paulo')]
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| begin_lat| begin_lon| driver| end_lat| end_lon| fare| rider| ts| uuid| partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
| 20220301162109926|20220301162109926...|aeea15f6-e5b7-438...| americas/united_s...|2cde0990-e337-4e8...|0.5731835407930634| 0.4923479652912024|driver-213|0.08988581780930216|0.42520899698713666| 64.27696295884016|rider-213|1645587087764|aeea15f6-e5b7-438...|americas/united_s...|
| 20220301162109926|20220301162109926...|a63b8b04-6dcc-4ed...| americas/brazil/s...|45b0d7ee-88ec-4a3...|0.4726905879569653|0.46157858450465483|driver-213| 0.754803407008858| 0.9671159942018241|34.158284716382845|rider-213|1645537223581|a63b8b04-6dcc-4ed...|americas/brazil/s...|
| 20220301162109926|20220301162109926...|172f2894-285a-4b4...| americas/brazil/s...|45b0d7ee-88ec-4a3...|0.6100070562136587| 0.8779402295427752|driver-213| 0.3407870505929602| 0.5030798142293655| 43.4923811219014|rider-213|1645608818472|172f2894-285a-4b4...|americas/brazil/s...|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+-------------------+----------+-------------------+-------------------+------------------+---------+-------------+--------------------+--------------------+
start to insert data into a new table:[[Row(begin_lat=0.6220454661413275, begin_lon=0.72024792576853, driver='driver-226', end_lat=0.9048755755365163, end_lon=0.727695054518325, fare=40.613510977307, partitionpath='americas/united_states/san_francisco', rider='rider-226', ts=1645914401180, uuid='db59af46-0fcf-4bb7-ab4a-b9387fb710d3')]]
22/03/01 16:21:14 WARN HoodieSparkSqlWriter$: hoodie table at file:/tmp/hudi_trips_cow_insert_overwirte already exists. Deleting existing data & overwriting with new data.
[Row(_hoodie_commit_time='20220301162114472', _hoodie_commit_seqno='20220301162114472_0_4', _hoodie_record_key='db59af46-0fcf-4bb7-ab4a-b9387fb710d3', _hoodie_partition_path='americas/united_states/san_francisco', _hoodie_file_name='2307cfa5-13d3-481d-8365-d8f8f4e1027a-0_0-35-58_20220301162114472.parquet', begin_lat=0.6220454661413275, begin_lon=0.72024792576853, driver='driver-226', end_lat=0.9048755755365163, end_lon=0.727695054518325, fare=40.613510977307, rider='rider-226', ts=1645914401180, uuid='db59af46-0fcf-4bb7-ab4a-b9387fb710d3', partitionpath='americas/united_states/san_francisco')]
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+----------------+----------+------------------+-----------------+---------------+---------+-------------+--------------------+--------------------+
|_hoodie_commit_time|_hoodie_commit_seqno| _hoodie_record_key|_hoodie_partition_path| _hoodie_file_name| begin_lat| begin_lon| driver| end_lat| end_lon| fare| rider| ts| uuid| partitionpath|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+----------------+----------+------------------+-----------------+---------------+---------+-------------+--------------------+--------------------+
| 20220301162114472|20220301162114472...|db59af46-0fcf-4bb...| americas/united_s...|2307cfa5-13d3-481...|0.6220454661413275|0.72024792576853|driver-226|0.9048755755365163|0.727695054518325|40.613510977307|rider-226|1645914401180|db59af46-0fcf-4bb...|americas/united_s...|
+-------------------+--------------------+--------------------+----------------------+--------------------+------------------+----------------+----------+------------------+-----------------+---------------+---------+-------------+--------------------+--------------------+
Process finished with exit code 0
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic.py
22/03/01 10:53:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 10:53:04 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
File "/Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic.py", line 13, in
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
TypeError: ' JavaPackage' object is not callable
Process finished with exit code 1
spark运行时候缺失jar包,将需要的jar包使用.config("spark.jars", "${YOUR_JAR_PATH})
指明即可。
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py
22/03/01 11:11:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 11:11:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
File "/Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_query.py", line 12, in <module>
tripsSnapshotDF = spark. \
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/sql/readwriter.py", line 204, in load
return self._df(self._jreader.load(path))
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o31.load.
: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: hudi.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
at scala.util.Failure.orElse(Try.scala:224)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
... 14 more
Process finished with exit code 1
spark运行时候缺失jar包,将需要的jar包使用.config("spark.jars", "${YOUR_JAR_PATH})
指明即可。
在使用「dataGen.generateUpdates(10)」的时候报的错,无法生成更新的数据
/Users/gavin/PycharmProjects/pythonProject/venv/bin/python /Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_update.py
22/03/01 13:15:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/03/01 13:15:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
File "/Users/gavin/PycharmProjects/pythonProject/venv/spark/hudi/basic_update.py", line 29, in <module>
updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(10))
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in __call__
return_value = get_return_value(
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/Users/gavin/PycharmProjects/pythonProject/venv/lib/python3.8/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o35.generateUpdates.
: org.apache.hudi.exception.HoodieException: Data must have been written before performing the update operation
at org.apache.hudi.QuickstartUtils$DataGenerator.generateUpdates(QuickstartUtils.java:180)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Process finished with exit code 1
根据报错信息可知,需要在update之前有一个insert动作,而对于dataGen来说,insert就是生成insert的数据,应该是为了基于insert的数据生成update的数据,所以,解决这个报错的方式就是在调用「dataGen.generateUpdates(10)」先调用一次「dataGen.generateInserts(10)」
[1] Apache Hudi官方文档: https://hudi.apache.org/docs/quick-start-guide