由于需要进行Hudi的详细选型,本文从0开始搭建一个Spark+Hudi的环境,并进行简单使用。
1)假设在Linux进行环境安装,操作系统选择Ubuntu 22.04 LTS版本。
2)Ubuntu的源配置清华源。
3)JDK安装完毕(当前是1.8版本,1.8.0_333)。
sudo apt-get install openssh-server
sudo service ssh restart
ssh-keygen -t rsa
ssh-copy-id -i id_rsa <your user name>@localhost
ssh localhost
Hadoop安装的是单节点伪分布式环境,版本选择和后继的Spark选择有关联。
例如:Hadoop 3.2.3
Hudi当前支持的是Spark3.2,对应的Spark也是3.2。
1)下载hadoop二进制包 https://hadoop.apache.org/releases.html
2)解压缩到自己方便的安装目录
1)添加环境变量 ~/.profile:
vi ~/.profile
export HADOOP_HOME=<YOUR_HADOOP_DECOMPRESSD_PATH>
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native"
source ~/.profile
给$HADOOP_HOME/etc/hadoop/hadoop-env.sh添加两个环境变量设置:
export JAVA_HOME=<YOUR_JAVA_HOME_PATH>
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib/native"
此处很奇怪,不知道为什么系统的JAVA_HOME没生效,在这个hadoop-env.sh中还要再设置一次,不然后面启动dfs的时候报错。
运行下hadoop version看下是否能执行
$ hadoop version
Hadoop 3.2.3
Source code repository https://github.com/apache/hadoop -r abe5358143720085498613d399be3bbf01e0f131
Compiled by ubuntu on 2022-03-20T01:18Z
Compiled with protoc 2.5.0
From source with checksum 39bb14faec14b3aa25388a6d7c345fe8
This command was run using /<your path>/hadoop-3.2.3/share/hadoop/common/hadoop-common-3.2.3.jar
2)创建几个hadoop存储文件的本地目录:
$ tree dfs
dfs
├── data
├── name
└── temp
3)修改配置文件
$HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://localhost:9000value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>file:/opt/dfs/tempvalue> <---- 你自己的临时文件目录
> Abase for other temporary directories.description>
property>
configuration>
$HADOOP_HOME/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dirname>
<value>/opt/dfs/namevalue> <---- 你自己的HDFS本地Namenode存储目录
property>
<property>
<name>dfs.datanode.data.dirname>
<value>/opt/dfs/datavalue> <---- 你自己的HDFS本地数据存储目录
property>
<property>
<name>dfs.replicationname>
<value>1value>
property>
<property>
<name>dfs.permissionsname>
<value>falsevalue>
property>
configuration>
4)格式化namenode
hdfs namenode -format
$ tree
.
├── data
├── name
│ └── current
│ ├── fsimage_0000000000000000000
│ ├── fsimage_0000000000000000000.md5
│ ├── seen_txid
│ └── VERSION
└── temp
4 directories, 4 files
5)启动Hadoop
start-all.sh
$ jps
13392 Jps
12363 NameNode
12729 SecondaryNameNode
12526 DataNode
12931 ResourceManager
13077 NodeManager
http://localhost:9870
就可以看到HDFS的管理页面。
因为Spark SQL需要使用Hive Meta Store(HMS),需要安装Hive。
1)下载,解压
从 http://hive.apache.org/downloads.html 下载了最新的版本3.1.3。
下载后解压,配置HIVE_HOME环境变量。
vi ~/.profile
export HIVE_HOME=/<YOUR PATH>/apache-hive-3.1.3-bin
export PATH=$HIVE_HOME/bin:$PATH
2)配置Hive,缺省连接hadoop和derby元数据,修改为连接MySQL作为元数据存储。
缺省hive的配置在$HIVE_HOME/conf中,以template结尾。需要改为使用的名字。
cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml
cp hive-env.sh.template hive-env.sh
cp hive-log4j2.properties.template hive-log4j2.properties
然后进行修改。
hive-site.xml添加:
<property>
<name>system:java.io.tmpdirname>
<value>/<YOUR_PATH>/setup/hivevalue> <--- 建立一个临时文件目录
> description>
property>
<property>
<name>system:user.namename>
<value><YOUR_NAME>value> <--- 访问HDFS用户名
> description>
property>
如果运行失败,需要将hadoop中的guava-27.0-jre.jar拷贝到$HIVE_HOME/lib中,并将guava-19.0.jar删除。
另外,hive-site.xml的3227行有一个异常字符,需要删除掉。
针对hive-env.sh的修改:
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/home/redstar/setup/hadoop-3.2.2
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/home/redstar/setup/apache-hive-3.1.2-bin/conf
修改hadoop的core-site.xml配置文件,此处修改为hadoop文件的用户名。
<property>
<name>hadoop.proxyuser.<your name>.hostsname>
<value>*value>
property>
<property>
<name>hadoop.proxyuser.<your name>.groupsname>
<value>*value>
property>
安装MySQL并创建HiveMetaStore对应的库。
安装MySQL,(MariaDB-Server也可以)
创建相关的用户,库和设置权限。
create user 'hive' identified by '<#YOUR_PASSWORD>';
create database hive;
grant all privileges on hive.* to 'hive'@'localhost' identified by '<#YOUR_PASSWORD>';
flush privileges;
修改hive的Metatore存储为MySQL
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/hive?createDatabaseIfNotExist=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value><#YOUR_PASSWORD>
<description>password to use against metastore database</description>
</property>
<property>
<name>hive.cli.print.header</name>
<value>true</value>
<description>Whether to print the names of the columns in query output.</description>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
<description>Whether to include the current database in the Hive prompt.</description>
</property>
将JDBC的驱动mysql-connector-java-8.0.27.jar拷贝到$HIVE_HOME/lib目录下。
初始化Hive的Metastore。使用MySQL存储。
schematool -initSchema -dbType mysql
3)验证hive的运行。
启动hadoop
cd $HADOOP_HOME/sbin
$ ./hadoop-daemon.sh start datanode
$ ./hadoop-daemon.sh start namenode
$ yarn-daemon.sh start nodemanager
$ yarn-daemon.sh start resourcemanager
启动hive
cd $HIVE_HOME/bin
./hiveserver2
查看日志,确保启动成功。
启动hive客户端beeline
./beeline
!connect jdbc:hive2://127.0.0.1:10000
如果提示连接成功,可以进行基本的hive建表及查询操作。
如果有错误,检查hadoop的磁盘文件权限。
下载,解压,并配置环境变量。
下载的时候需要注意一下Scala的版本和Spark的匹配。当前Spark 3.2+是使用Scala2.12进行的预编译。
https://www.scala-lang.org/download/scala2.html
vi ~/.profile
export SCALA_HOME=/home/redstar/setup/scala-2.12.16
export PATH=$SCALA_HOME/bin:$PATH
source ~/.profile
运行scala查看是否安装正常。
$ scala
Welcome to Scala 2.12.16 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_333).
Type in expressions for evaluation. Or try :help.
scala>
1)下载 http://spark.apache.org/downloads.html
下载的时候需要看下Hadoop的支持版本。例如:spark-3.2.1-bin-hadoop3.2.tgz是指支持的Hadoop是3.2以上。
2)添加环境变量: ~/.profile:
vi ~/.profile
export SPARK_HOME=/<YOUR_SPARK_PATH>/spark-3.2.1-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$PATH
source ~/.profile
3)修改Spark的配置
初始下载的是$SPARK_HOME/conf/spark-env.sh.template,更名为spark-env.sh
JAVA_HOME=<YOUR_JAVA_HOME>
SCALA_HOME=<YOUR_SCALA_HOME>
HADOOP_CONF_DIR=/<YOUR_HADOOP_HOME>/etc/hadoop
SPARK_MASTER_HOST=localhost
SPARK_WORKER_MEMORY=4g
4)启动spark
$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /<your path>/spark-3.2.1-bin-hadoop3.2/logs/spark-xxxxxxx-org.apache.spark.deploy.master.Master-1-xxxxxxx-Precision-5520.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /<your path>/spark-3.2.1-bin-hadoop3.2/logs/spark-xxxxxxx-org.apache.spark.deploy.worker.Worker-1-xxxxxxx-Precision-5520.out
$ jps
12931 ResourceManager
13077 NodeManager
12729 SecondaryNameNode
12363 NameNode
14173 Jps
12526 DataNode
13974 Master
14101 Worker
http://localhost:8080
就可以看到Spark的管理页面。
5)跑个简单的计算PI的例子,看下是否安装成功。
$ spark-submit --class org.apache.spark.examples.SparkPi --master spark://localhost:7077 --num-executors 3 --driver-memory 1g --executor-memory 1g --executor-cores 1 ../examples/jars/spark-examples_2.12-3.2.1.jar 10
...
...
22/07/10 00:05:02 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 4.756610 s
Pi is roughly 3.143159143159143
22/07/10 00:05:02 INFO SparkUI: Stopped Spark web UI at http://10.0.0.13:4040
22/07/10 00:05:02 INFO StandaloneSchedulerBackend: Shutting down all executors
22/07/10 00:05:02 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/07/10 00:05:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/07/10 00:05:02 INFO MemoryStore: MemoryStore cleared
22/07/10 00:05:02 INFO BlockManager: BlockManager stopped
22/07/10 00:05:02 INFO BlockManagerMaster: BlockManagerMaster stopped
22/07/10 00:05:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/07/10 00:05:02 INFO SparkContext: Successfully stopped SparkContext
22/07/10 00:05:02 INFO ShutdownHookManager: Shutdown hook called
22/07/10 00:05:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-4a66d2a4-b0c3-4b0c-b9de-ca1ec61f745b
22/07/10 00:05:02 INFO ShutdownHookManager: Deleting directory /tmp/spark-328e3ea2-5843-454d-9749-3af87223ef6a
6)Spark-shell执行一个简单的统计。
hdfs dfs -put /<YOUR_PATH>/hadoop-3.2.3/README.txt /test_data
hdfs dfs -ls /test_data
spark-shell --master local[4]
scala> val datasRDD = sc.textFile("/test_data/README.txt")
scala> datasRDD.count
scala> datasRDD.first
scala> quit
将mysql-connector-java-8.0.27.jar拷贝到$SPACK_HOME/jar目录下。
启动Spark-SQL
#启动HMS
hive --service metastore
cd $SPARK_HOME/bin
spark-sql
1)下载,解压,编译
https://hudi.apache.org/releases/download
下载最新的源码,当前是0.11.1
解压,并进入目录,进行编译:
$ mvn clean package -DskipTests -Dspark3.2 -Dscala-2.12
...
...
[INFO] hudi-examples-common ............................... SUCCESS [ 2.279 s]
[INFO] hudi-examples-spark ................................ SUCCESS [ 6.174 s]
[INFO] hudi-flink-datasource .............................. SUCCESS [ 0.037 s]
[INFO] hudi-flink1.14.x ................................... SUCCESS [ 0.143 s]
[INFO] hudi-flink ......................................... SUCCESS [ 3.311 s]
[INFO] hudi-examples-flink ................................ SUCCESS [ 1.859 s]
[INFO] hudi-examples-java ................................. SUCCESS [ 2.403 s]
[INFO] hudi-flink1.13.x ................................... SUCCESS [ 0.338 s]
[INFO] hudi-kafka-connect ................................. SUCCESS [ 2.154 s]
[INFO] hudi-flink1.14-bundle_2.12 ......................... SUCCESS [ 23.147 s]
[INFO] hudi-kafka-connect-bundle .......................... SUCCESS [ 27.037 s]
[INFO] hudi-spark2_2.12 ................................... SUCCESS [ 12.065 s]
[INFO] hudi-spark2-common ................................. SUCCESS [ 0.061 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 07:08 min
[INFO] Finished at: 2022-07-10T13:22:24+08:00
[INFO] ------------------------------------------------------------------------
将编译好的Jar包拷贝到spark-3.2.1-bin-hadoop3.2/jars目录下。
启动SparkSQL客户端:
spark-sql \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'
一些SQL的测试:
create table test_t1_hudi_cow (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh);
insert into test_t1_hudi_cow select 1, 'a0', 1000, '2021-12-09', '10';
select * from test_t1_hudi_cow;
-- record id=1 changes `name`
insert into test_t1_hudi_cow select 1, 'a1', 1001, '2021-12-09', '10';
select * from test_t1_hudi_cow;
-- time travel based on first commit time, assume `20220307091628793`
select * from test_t1_hudi_cow timestamp as of '20220307091628793' where id = 1;
-- time travel based on different timestamp formats
select * from test_t1_hudi_cow timestamp as of '2022-03-07 09:16:28.100' where id = 1;
select * from test_t1_hudi_cow timestamp as of '2022-03-08' where id = 1;
查看下Hudi生成了什么数据:
#------------------------------------------------------
create table test_t2_hudi_cow (
id bigint,
name string,
ts bigint,
dt string,
hh string
) using hudi
tblproperties (
type = 'cow',
primaryKey = 'id',
preCombineField = 'ts'
)
partitioned by (dt, hh);
#------------------------------------------------------
test_t2_hudi_cow$ tree -a
.
└── .hoodie
├── archived
├── .aux
│ └── .bootstrap
│ ├── .fileids
│ └── .partitions
├── hoodie.properties
├── .hoodie.properties.crc
├── .schema
└── .temp
#------------------------------------------------------
insert into test_t2_hudi_cow select 1, 'a0', 1000, '2021-12-09', '10';
test_t2_hudi_cow$ tree -a
.
├── dt=2021-12-09
│ └── hh=10
│ ├── 6f3c5a3b-e562-4398-b01d-223d85165193-0_0-140-2521_20220710225229810.parquet
│ ├── .6f3c5a3b-e562-4398-b01d-223d85165193-0_0-140-2521_20220710225229810.parquet.crc
│ ├── .hoodie_partition_metadata
│ └── ..hoodie_partition_metadata.crc
└── .hoodie
├── 20220710225229810.commit
├── .20220710225229810.commit.crc
├── 20220710225229810.commit.requested
├── .20220710225229810.commit.requested.crc
├── 20220710225229810.inflight
├── .20220710225229810.inflight.crc
├── archived
├── .aux
│ └── .bootstrap
│ ├── .fileids
│ └── .partitions
├── hoodie.properties
├── .hoodie.properties.crc
├── metadata
│ ├── files
│ │ ├── .files-0000_00000000000000.log.1_0-0-0
│ │ ├── ..files-0000_00000000000000.log.1_0-0-0.crc
│ │ ├── .files-0000_00000000000000.log.1_0-119-1312
│ │ ├── ..files-0000_00000000000000.log.1_0-119-1312.crc
│ │ ├── .files-0000_00000000000000.log.2_0-150-2527
│ │ ├── ..files-0000_00000000000000.log.2_0-150-2527.crc
│ │ ├── .hoodie_partition_metadata
│ │ └── ..hoodie_partition_metadata.crc
│ └── .hoodie
│ ├── 00000000000000.deltacommit
│ ├── .00000000000000.deltacommit.crc
│ ├── 00000000000000.deltacommit.inflight
│ ├── .00000000000000.deltacommit.inflight.crc
│ ├── 00000000000000.deltacommit.requested
│ ├── .00000000000000.deltacommit.requested.crc
│ ├── 20220710225229810.deltacommit
│ ├── .20220710225229810.deltacommit.crc
│ ├── 20220710225229810.deltacommit.inflight
│ ├── .20220710225229810.deltacommit.inflight.crc
│ ├── 20220710225229810.deltacommit.requested
│ ├── .20220710225229810.deltacommit.requested.crc
│ ├── archived
│ ├── .aux
│ │ └── .bootstrap
│ │ ├── .fileids
│ │ └── .partitions
│ ├── .heartbeat
│ ├── hoodie.properties
│ ├── .hoodie.properties.crc
│ ├── .schema
│ └── .temp
├── .schema
└── .temp
#------------------------------------------------------
insert into test_t2_hudi_cow select 1, 'a1', 1001, '2021-12-09', '10';
select * from test_t2_hudi_cow;
test_t2_hudi_cow$ tree -a
.
├── dt=2021-12-09
│ └── hh=10
│ ├── 6f3c5a3b-e562-4398-b01d-223d85165193-0_0-140-2521_20220710225229810.parquet
│ ├── .6f3c5a3b-e562-4398-b01d-223d85165193-0_0-140-2521_20220710225229810.parquet.crc
│ ├── 6f3c5a3b-e562-4398-b01d-223d85165193-0_0-178-3798_20220710230154422.parquet
│ ├── .6f3c5a3b-e562-4398-b01d-223d85165193-0_0-178-3798_20220710230154422.parquet.crc
│ ├── .hoodie_partition_metadata+
│ └── ..hoodie_partition_metadata.crc
├── .hoodie
│ ├── 20220710225229810.commit
│ ├── .20220710225229810.commit.crc
│ ├── 20220710225229810.commit.requested
│ ├── .20220710225229810.commit.requested.crc
│ ├── 20220710225229810.inflight
│ ├── .20220710225229810.inflight.crc
│ ├── 20220710230154422.commit
│ ├── .20220710230154422.commit.crc
│ ├── 20220710230154422.commit.requested
│ ├── .20220710230154422.commit.requested.crc
│ ├── 20220710230154422.inflight
│ ├── .20220710230154422.inflight.crc
│ ├── archived
│ ├── .aux
│ │ └── .bootstrap
│ │ ├── .fileids
│ │ └── .partitions
│ ├── hoodie.properties
│ ├── .hoodie.properties.crc
│ ├── metadata
│ │ ├── files
│ │ │ ├── .files-0000_00000000000000.log.1_0-0-0
│ │ │ ├── ..files-0000_00000000000000.log.1_0-0-0.crc
│ │ │ ├── .files-0000_00000000000000.log.1_0-119-1312
│ │ │ ├── ..files-0000_00000000000000.log.1_0-119-1312.crc
│ │ │ ├── .files-0000_00000000000000.log.2_0-150-2527
│ │ │ ├── ..files-0000_00000000000000.log.2_0-150-2527.crc
│ │ │ ├── .files-0000_00000000000000.log.3_0-188-3804
│ │ │ ├── ..files-0000_00000000000000.log.3_0-188-3804.crc
│ │ │ ├── .hoodie_partition_metadata
│ │ │ └── ..hoodie_partition_metadata.crc
│ │ └── .hoodie
│ │ ├── 00000000000000.deltacommit
│ │ ├── .00000000000000.deltacommit.crc
│ │ ├── 00000000000000.deltacommit.inflight
│ │ ├── .00000000000000.deltacommit.inflight.crc
│ │ ├── 00000000000000.deltacommit.requested
│ │ ├── .00000000000000.deltacommit.requested.crc
│ │ ├── 20220710225229810.deltacommit
│ │ ├── .20220710225229810.deltacommit.crc
│ │ ├── 20220710225229810.deltacommit.inflight
│ │ ├── .20220710225229810.deltacommit.inflight.crc
│ │ ├── 20220710225229810.deltacommit.requested
│ │ ├── .20220710225229810.deltacommit.requested.crc
│ │ ├── 20220710230154422.deltacommit
│ │ ├── .20220710230154422.deltacommit.crc
│ │ ├── 20220710230154422.deltacommit.inflight
│ │ ├── .20220710230154422.deltacommit.inflight.crc
│ │ ├── 20220710230154422.deltacommit.requested
│ │ ├── .20220710230154422.deltacommit.requested.crc
│ │ ├── archived
│ │ ├── .aux
│ │ │ └── .bootstrap
│ │ │ ├── .fileids
│ │ │ └── .partitions
│ │ ├── .heartbeat
│ │ ├── hoodie.properties
│ │ ├── .hoodie.properties.crc
│ │ ├── .schema
│ │ └── .temp
│ ├── .schema
│ └── .temp
└── .idea
├── .gitignore
├── misc.xml
├── modules.xml
├── runConfigurations.xml
├── test_t2_hudi_cow.iml
├── vcs.xml
└── workspace.xml