搭建Hive on spark环境 -- Spark 编译
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Started
Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency. To remove Hive jars from the installation, simply use the following command under your Spark repository:
Wiki给出了如下Spark编译命令
Since Spark 2.3.0:
.
/dev/make-distribution
.sh --name
"hadoop2-without-hive"
--tgz
"-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"
Pyarn 设置编译版本支持yarn
hadoop-provided 设置编译版本不包含hadoop相关的jar包
hadoop-2.7 设置编译版本兼容的hadoop大版本是2.7
parquet-provided 设置编译版本不包含parquet相关的jar包
orc-provided 设置编译版本不包含orc相关的jar包
Version Compatibility
Hive on Spark is only tested with a specific version of Spark, so a given version of Hive is only guaranteed to work with a specific version of Spark. Other versions of Spark may work with a given version of Hive, but that is not guaranteed. Below is a list of Hive versions and their corresponding compatible Spark versions.
Hive Version
Spark Version
master 2.3.0 3.0.x 2.3.0 2.3.x 2.0.0 2.2.x 1.6.0 2.1.x 1.6.0 2.0.x 1.5.0 1.2.x 1.3.1 1.1.x 1.2.0
http://spark.apache.org/docs/2.3.0/building-spark.html
Building Apache Spark
Apache Maven
The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+. Note that support for Java 7 was removed as of Spark 2.2.0.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
系统环境
操作系统: CentOS Linux release 7.7.1908 (Core)
[ghl@ghlhost etc]$ cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)
java:java version "1.8.0_231"
[ghl@ghlhost etc]$ java -version
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
maven:apache-maven-3.6.3
[ghl@ghlhost etc]$ mvn -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /u01/apache-maven-3.6.3
Java version: 1.8.0_231, vendor: Oracle Corporation, runtime: /u01/jdk1.8.0_231/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-1062.9.1.el7.x86_64", arch: "amd64", family: "unix"
spark:spark-2.3.0
进入spark源码根目录
[ghl@ghlhost spark-2.3.0]$ ll
total 288
-rw-r--r-- 1 ghl ghl 2318 Feb 23 2018 appveyor.yml
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 assembly
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 bin
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 build
drwxr-xr-x 9 ghl ghl 4096 Feb 23 2018 common
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 conf
-rw-r--r-- 1 ghl ghl 995 Feb 23 2018 CONTRIBUTING.md
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 core
drwxr-xr-x 5 ghl ghl 4096 Feb 23 2018 data
drwxr-xr-x 6 ghl ghl 4096 Feb 23 2018 dev
drwxr-xr-x 9 ghl ghl 4096 Feb 23 2018 docs
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 examples
drwxr-xr-x 15 ghl ghl 4096 Feb 23 2018 external
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 graphx
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 hadoop-cloud
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 launcher
-rw-r--r-- 1 ghl ghl 18045 Feb 23 2018 LICENSE
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 licenses
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 mllib
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 mllib-local
-rw-r--r-- 1 ghl ghl 24913 Feb 23 2018 NOTICE
-rw-r--r-- 1 ghl ghl 101688 Feb 23 2018 pom.xml
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 project
drwxr-xr-x 6 ghl ghl 4096 Feb 23 2018 python
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 R
-rw-r--r-- 1 ghl ghl 3809 Feb 23 2018 README.md
drwxr-xr-x 5 ghl ghl 4096 Feb 23 2018 repl
drwxr-xr-x 5 ghl ghl 4096 Feb 23 2018 resource-managers
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 sbin
-rw-r--r-- 1 ghl ghl 17624 Feb 23 2018 scalastyle-config.xml
drwxr-xr-x 29 ghl ghl 4096 Feb 23 2018 spark
drwxr-xr-x 6 ghl ghl 4096 Feb 23 2018 sql
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 streaming
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 tools
[ghl@ghlhost spark-2.3.0]$ pwd
/home/ghl/softwares/spark-2.3.0
设置MAVEN_OPTS
[ghl@ghlhost spark-2.3.0]$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
[ghl@ghlhost spark-2.3.0]$ echo $MAVEN_OPTS
-Xmx2g -XX:ReservedCodeCacheSize=512m
修改pom.xml
235
236
248
249
230
231
232 central
233
234 Maven Repository
235
236 https://maven.aliyun.com/nexus/content/groups/public/
237
238 true
239
240
241 false
242
243
244
245
246
247 central
248
249 https://maven.aliyun.com/nexus/content/groups/public/
250
251 true
252
253
254 false
255
256
257
开始编译
./dev/make-distribution.sh --name "hadoop277-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided" -Dhadoop.version=2.7.7
这里指定了hadoop的版本为2.7.7 “-Dhadoop.version=2.7.7”
main:
[INFO] Executed tasks
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 2.3.0:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [02:45 min]
[INFO] Spark Project Tags ................................. SUCCESS [ 21.019 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 11.515 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 17.450 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 18.672 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 8.324 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 18.598 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 44.480 s]
[INFO] Spark Project Core ................................. SUCCESS [07:05 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 40.373 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 31.986 s]
[INFO] Spark Project Streaming ............................ SUCCESS [01:10 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:26 min]
[INFO] Spark Project SQL .................................. SUCCESS [05:16 min]
[INFO] Spark Project ML Library ........................... SUCCESS [03:22 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 11.906 s]
[INFO] Spark Project Hive ................................. SUCCESS [01:33 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 9.409 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 12.235 s]
[INFO] Spark Project YARN ................................. SUCCESS [ 37.168 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 9.423 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 20.819 s]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [ 18.046 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 31.207 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 8.645 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 30:53 min
[INFO] Finished at: 2020-01-14T10:42:13+08:00
[INFO] ------------------------------------------------------------------------
编译成功之后会再源码目录下生成一个文件 park-2.3.0-bin-hadoop277-without-hive.tgz (其中hadoop277-without-hive 就是--name "hadoop277-without-hive"指定的名称),这个就是我们需要的不包含hive依赖jar包的spark版本。
[ghl@ghlhost spark-2.3.0]$ ll -t
total 131152
-rw-rw-r-- 1 ghl ghl 133992952 Jan 14 10:42 spark-2.3.0-bin-hadoop277-without-hive.tgz
drwxrwxr-x 11 ghl ghl 4096 Jan 14 10:42 dist
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:42 assembly
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:41 examples
drwxr-xr-x 6 ghl ghl 4096 Jan 14 10:41 repl
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:39 mllib
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:26 streaming
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:25 graphx
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:24 core
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:17 mllib-local
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:16 launcher
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:14 tools
drwxrwxr-x 8 ghl ghl 4096 Jan 14 10:14 target
drwxr-xr-x 4 ghl ghl 4096 Jan 14 10:10 build
-rw-r--r-- 1 ghl ghl 101845 Jan 14 09:31 pom.xml
drwxr-xr-x 29 ghl ghl 4096 Feb 23 2018 spark
drwxr-xr-x 6 ghl ghl 4096 Feb 23 2018 sql
-rw-r--r-- 1 ghl ghl 2318 Feb 23 2018 appveyor.yml
drwxr-xr-x 9 ghl ghl 4096 Feb 23 2018 common
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 conf
-rw-r--r-- 1 ghl ghl 995 Feb 23 2018 CONTRIBUTING.md
drwxr-xr-x 5 ghl ghl 4096 Feb 23 2018 data
drwxr-xr-x 6 ghl ghl 4096 Feb 23 2018 dev
drwxr-xr-x 15 ghl ghl 4096 Feb 23 2018 external
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 project
-rw-r--r-- 1 ghl ghl 17624 Feb 23 2018 scalastyle-config.xml
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 bin
drwxr-xr-x 9 ghl ghl 4096 Feb 23 2018 docs
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 hadoop-cloud
-rw-r--r-- 1 ghl ghl 18045 Feb 23 2018 LICENSE
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 licenses
-rw-r--r-- 1 ghl ghl 24913 Feb 23 2018 NOTICE
drwxr-xr-x 6 ghl ghl 4096 Feb 23 2018 python
drwxr-xr-x 3 ghl ghl 4096 Feb 23 2018 R
-rw-r--r-- 1 ghl ghl 3809 Feb 23 2018 README.md
drwxr-xr-x 5 ghl ghl 4096 Feb 23 2018 resource-managers
drwxr-xr-x 2 ghl ghl 4096 Feb 23 2018 sbin
[ghl@ghlhost spark-2.3.0]$
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1. 首次尝试编译的时候失败了,原因是首次使用了默认的maven源
第二次修改为
2. 编译时前面几个步骤会比较慢,需要耐心等待
[ghl@ghlhost spark-2.3.0]$ ./dev/make-distribution.sh --name "hadoop277-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided" -Dhadoop.version=2.7.7
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/home/ghl/softwares/spark-2.3.0
+ DISTDIR=/home/ghl/softwares/spark-2.3.0/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/home/ghl/softwares/spark-2.3.0/build/mvn
+ (( 5 ))
+ case $1 in
+ NAME=hadoop277-without-hive
+ shift
+ shift
+ (( 3 ))
+ case $1 in
+ MAKE_TGZ=true
+ shift
+ (( 2 ))
+ case $1 in
+ break
+ '[' -z /u01/jdk1.8.0_231 ']'
+ '[' -z /u01/jdk1.8.0_231 ']'
++ command -v git
+ '[' ']'
++ command -v /home/ghl/softwares/spark-2.3.0/build/mvn
+ '[' '!' /home/ghl/softwares/spark-2.3.0/build/mvn ']'
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=project.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ VERSION=2.3.0
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=scala.binary.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ SCALA_VERSION=2.11
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=hadoop.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ SPARK_HADOOP_VERSION=2.7.7
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=project.activeProfiles -pl sql/hive -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ fgrep --count 'hive '
++ echo -n
3. 编译后的spark版本已上传
https://download.csdn.net/download/ghl0451/12101209
目前还在审核。