spark2.3.0 without hive 编译

搭建Hive on spark环境 -- Spark 编译

https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark:+Getting+Started

  • 根据以上Hive 的Wiki得知Hive on spark环境需要Spark不包含Hive相关jar包。

Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile. If you will use Parquet tables, it's recommended to also enable the "parquet-provided" profile. Otherwise there could be conflicts in Parquet dependency. To remove Hive jars from the installation, simply use the following command under your Spark repository:

Wiki给出了如下Spark编译命令

Since Spark 2.3.0:

./dev/make-distribution.sh --name "hadoop2-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided"

  1. Pyarn 设置编译版本支持yarn
  2. hadoop-provided 设置编译版本不包含hadoop相关的jar包
  3. hadoop-2.7 设置编译版本兼容的hadoop大版本是2.7
  4. parquet-provided 设置编译版本不包含parquet相关的jar包
  5. orc-provided 设置编译版本不包含orc相关的jar包
  • 当前最新版本的Hive 是hive-3.1.2。根据Wiki,hive-3.1.2兼容的Spark版本是Spark-2.3.0。

Version Compatibility

Hive on Spark is only tested with a specific version of Spark, so a given version of Hive is only guaranteed to work with a specific version of Spark. Other versions of Spark may work with a given version of Hive, but that is not guaranteed. Below is a list of Hive versions and their corresponding compatible Spark versions.

Hive Version

Spark Version

master 2.3.0
3.0.x 2.3.0
2.3.x 2.0.0
2.2.x 1.6.0
2.1.x 1.6.0
2.0.x 1.5.0
1.2.x 1.3.1
1.1.x 1.2.0
  • 根据spark官方文档,要编译spark需要Maven 3.3.9 or newer and Java 8+

http://spark.apache.org/docs/2.3.0/building-spark.html

Building Apache Spark

Apache Maven

The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.3.9 or newer and Java 8+. Note that support for Java 7 was removed as of Spark 2.2.0.

  • 以下开始编译spark-2.3.0

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

系统环境

操作系统: CentOS Linux release 7.7.1908 (Core)

[ghl@ghlhost etc]$ cat /etc/centos-release
CentOS Linux release 7.7.1908 (Core)

java:java version "1.8.0_231"

[ghl@ghlhost etc]$ java -version
java version "1.8.0_231"
Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)

maven:apache-maven-3.6.3

[ghl@ghlhost etc]$ mvn -version
Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /u01/apache-maven-3.6.3
Java version: 1.8.0_231, vendor: Oracle Corporation, runtime: /u01/jdk1.8.0_231/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-1062.9.1.el7.x86_64", arch: "amd64", family: "unix"

spark:spark-2.3.0

进入spark源码根目录

[ghl@ghlhost spark-2.3.0]$ ll
total 288
-rw-r--r--  1 ghl ghl   2318 Feb 23  2018 appveyor.yml
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 assembly
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 bin
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 build
drwxr-xr-x  9 ghl ghl   4096 Feb 23  2018 common
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 conf
-rw-r--r--  1 ghl ghl    995 Feb 23  2018 CONTRIBUTING.md
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 core
drwxr-xr-x  5 ghl ghl   4096 Feb 23  2018 data
drwxr-xr-x  6 ghl ghl   4096 Feb 23  2018 dev
drwxr-xr-x  9 ghl ghl   4096 Feb 23  2018 docs
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 examples
drwxr-xr-x 15 ghl ghl   4096 Feb 23  2018 external
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 graphx
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 hadoop-cloud
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 launcher
-rw-r--r--  1 ghl ghl  18045 Feb 23  2018 LICENSE
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 licenses
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 mllib
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 mllib-local
-rw-r--r--  1 ghl ghl  24913 Feb 23  2018 NOTICE
-rw-r--r--  1 ghl ghl 101688 Feb 23  2018 pom.xml
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 project
drwxr-xr-x  6 ghl ghl   4096 Feb 23  2018 python
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 R
-rw-r--r--  1 ghl ghl   3809 Feb 23  2018 README.md
drwxr-xr-x  5 ghl ghl   4096 Feb 23  2018 repl
drwxr-xr-x  5 ghl ghl   4096 Feb 23  2018 resource-managers
drwxr-xr-x  2 ghl ghl   4096 Feb 23  2018 sbin
-rw-r--r--  1 ghl ghl  17624 Feb 23  2018 scalastyle-config.xml
drwxr-xr-x 29 ghl ghl   4096 Feb 23  2018 spark
drwxr-xr-x  6 ghl ghl   4096 Feb 23  2018 sql
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 streaming
drwxr-xr-x  3 ghl ghl   4096 Feb 23  2018 tools
[ghl@ghlhost spark-2.3.0]$ pwd
/home/ghl/softwares/spark-2.3.0

设置MAVEN_OPTS

[ghl@ghlhost spark-2.3.0]$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
[ghl@ghlhost spark-2.3.0]$ echo $MAVEN_OPTS
-Xmx2g -XX:ReservedCodeCacheSize=512m

修改pom.xml

   235          
   236          https://maven.aliyun.com/nexus/content/groups/public/

   248          
   249          https://maven.aliyun.com/nexus/content/groups/public/

   230	  
   231	    
   232	      central
   233	      
   234	      Maven Repository
   235	      
   236	      https://maven.aliyun.com/nexus/content/groups/public/
   237	      
   238	        true
   239	      
   240	      
   241	        false
   242	      
   243	    
   244	  
   245	  
   246	    
   247	      central
   248	      
   249	      https://maven.aliyun.com/nexus/content/groups/public/
   250	      
   251	        true
   252	      
   253	      
   254	        false
   255	      
   256	    
   257	  

 

开始编译

./dev/make-distribution.sh --name "hadoop277-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided" -Dhadoop.version=2.7.7

这里指定了hadoop的版本为2.7.7 “-Dhadoop.version=2.7.7”

main:
[INFO] Executed tasks
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 2.3.0:
[INFO] 
[INFO] Spark Project Parent POM ........................... SUCCESS [02:45 min]
[INFO] Spark Project Tags ................................. SUCCESS [ 21.019 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 11.515 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 17.450 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 18.672 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  8.324 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 18.598 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 44.480 s]
[INFO] Spark Project Core ................................. SUCCESS [07:05 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 40.373 s]
[INFO] Spark Project GraphX ............................... SUCCESS [ 31.986 s]
[INFO] Spark Project Streaming ............................ SUCCESS [01:10 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:26 min]
[INFO] Spark Project SQL .................................. SUCCESS [05:16 min]
[INFO] Spark Project ML Library ........................... SUCCESS [03:22 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 11.906 s]
[INFO] Spark Project Hive ................................. SUCCESS [01:33 min]
[INFO] Spark Project REPL ................................. SUCCESS [  9.409 s]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 12.235 s]
[INFO] Spark Project YARN ................................. SUCCESS [ 37.168 s]
[INFO] Spark Project Assembly ............................. SUCCESS [  9.423 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 20.819 s]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [ 18.046 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 31.207 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [  8.645 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  30:53 min
[INFO] Finished at: 2020-01-14T10:42:13+08:00
[INFO] ------------------------------------------------------------------------

编译成功之后会再源码目录下生成一个文件 park-2.3.0-bin-hadoop277-without-hive.tgz (其中hadoop277-without-hive 就是--name "hadoop277-without-hive"指定的名称),这个就是我们需要的不包含hive依赖jar包的spark版本。

[ghl@ghlhost spark-2.3.0]$ ll -t
total 131152
-rw-rw-r--  1 ghl ghl 133992952 Jan 14 10:42 spark-2.3.0-bin-hadoop277-without-hive.tgz
drwxrwxr-x 11 ghl ghl      4096 Jan 14 10:42 dist
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:42 assembly
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:41 examples
drwxr-xr-x  6 ghl ghl      4096 Jan 14 10:41 repl
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:39 mllib
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:26 streaming
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:25 graphx
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:24 core
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:17 mllib-local
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:16 launcher
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:14 tools
drwxrwxr-x  8 ghl ghl      4096 Jan 14 10:14 target
drwxr-xr-x  4 ghl ghl      4096 Jan 14 10:10 build
-rw-r--r--  1 ghl ghl    101845 Jan 14 09:31 pom.xml
drwxr-xr-x 29 ghl ghl      4096 Feb 23  2018 spark
drwxr-xr-x  6 ghl ghl      4096 Feb 23  2018 sql
-rw-r--r--  1 ghl ghl      2318 Feb 23  2018 appveyor.yml
drwxr-xr-x  9 ghl ghl      4096 Feb 23  2018 common
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 conf
-rw-r--r--  1 ghl ghl       995 Feb 23  2018 CONTRIBUTING.md
drwxr-xr-x  5 ghl ghl      4096 Feb 23  2018 data
drwxr-xr-x  6 ghl ghl      4096 Feb 23  2018 dev
drwxr-xr-x 15 ghl ghl      4096 Feb 23  2018 external
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 project
-rw-r--r--  1 ghl ghl     17624 Feb 23  2018 scalastyle-config.xml
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 bin
drwxr-xr-x  9 ghl ghl      4096 Feb 23  2018 docs
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 hadoop-cloud
-rw-r--r--  1 ghl ghl     18045 Feb 23  2018 LICENSE
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 licenses
-rw-r--r--  1 ghl ghl     24913 Feb 23  2018 NOTICE
drwxr-xr-x  6 ghl ghl      4096 Feb 23  2018 python
drwxr-xr-x  3 ghl ghl      4096 Feb 23  2018 R
-rw-r--r--  1 ghl ghl      3809 Feb 23  2018 README.md
drwxr-xr-x  5 ghl ghl      4096 Feb 23  2018 resource-managers
drwxr-xr-x  2 ghl ghl      4096 Feb 23  2018 sbin
[ghl@ghlhost spark-2.3.0]$

 

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1. 首次尝试编译的时候失败了,原因是首次使用了默认的maven源 https://repo.maven.apache.org/maven2

第二次修改为https://maven.aliyun.com/nexus/content/groups/public/后编译再半小时左右完成了
2. 编译时前面几个步骤会比较慢,需要耐心等待

[ghl@ghlhost spark-2.3.0]$ ./dev/make-distribution.sh --name "hadoop277-without-hive" --tgz "-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided" -Dhadoop.version=2.7.7
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/home/ghl/softwares/spark-2.3.0
+ DISTDIR=/home/ghl/softwares/spark-2.3.0/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/home/ghl/softwares/spark-2.3.0/build/mvn
+ ((  5  ))
+ case $1 in
+ NAME=hadoop277-without-hive
+ shift
+ shift
+ ((  3  ))
+ case $1 in
+ MAKE_TGZ=true
+ shift
+ ((  2  ))
+ case $1 in
+ break
+ '[' -z /u01/jdk1.8.0_231 ']'
+ '[' -z /u01/jdk1.8.0_231 ']'
++ command -v git
+ '[' ']'
++ command -v /home/ghl/softwares/spark-2.3.0/build/mvn
+ '[' '!' /home/ghl/softwares/spark-2.3.0/build/mvn ']'
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=project.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ VERSION=2.3.0
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=scala.binary.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ SCALA_VERSION=2.11
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=hadoop.version -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ tail -n 1
+ SPARK_HADOOP_VERSION=2.7.7
++ /home/ghl/softwares/spark-2.3.0/build/mvn help:evaluate -Dexpression=project.activeProfiles -pl sql/hive -Pyarn,hadoop-provided,hadoop-2.7,parquet-provided,orc-provided -Dhadoop.version=2.7.7
++ grep -v INFO
++ fgrep --count 'hive'
++ echo -n

3. 编译后的spark版本已上传

https://download.csdn.net/download/ghl0451/12101209

目前还在审核。

你可能感兴趣的:(Hive,Spark,hadoop)