1、Spark源码下载
Spark官网提供了预编译版本的Spark,但是要获得hive支持的Spark版本必须自己下载Spark源码进行编译加入hive支持。
笔者下载了Spark-2.3.1,用Maven-3.5.4进行编译,最后打包编译好的Spark进行集群部署。Hadoop版本为2.7.6,hive版本为1.2.2,Scala版本为2.11.8。操作系统为macOS,如果在ubuntu下编译原理一样。Spark-2.3.1 官网编译要求中提到要使用Maven构建Spark需要Maven 3.3.9或更高版本以及Java 8+。
2、Maven的安装配置
QiColindeMacBook-Air:~ Colin$ mvn -version
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-18T02:33:14+08:00)
Maven home: /usr/local/Cellar/maven/3.5.4/libexec
Java version: 1.8.0_73, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/jdk1.8.0_73.jdk/Contents/Home/jre
Default locale: zh_CN, platform encoding: UTF-8
OS name: "mac os x", version: "10.13.5", arch: "x86_64", family: "mac"
vim ~/.bash_profile
配置Maven_HOME, PATH, MAVEN_OPTS,然后Source ~/.bash_profile
使环境变量生效。
export MAVEN_HOME=/usr/local/Cellar/maven/3.5.4
export PATH=$PATH:$MAVEN/bin
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
3、编译Spark源码
进入Spark-2.3.1源码文件目录下的dev文件夹,执行change-scala-version.sh 2.11命令。
QiColindeMacBook-Air:~ Colin$ cd /Users/Colin/Downloads/spark-2.3.1
QiColindeMacBook-Air:spark-2.3.1 Colin$ ./dev/change-scala-version.sh 2.11
./dev/../resource-managers/yarn/pom.xml
./dev/../resource-managers/mesos/pom.xml
./dev/../resource-managers/kubernetes/core/pom.xml
./dev/../launcher/pom.xml
./dev/../tools/pom.xml
./dev/../core/pom.xml
./dev/../hadoop-cloud/pom.xml
./dev/../assembly/pom.xml
./dev/../graphx/pom.xml
./dev/../mllib/pom.xml
./dev/../repl/pom.xml
./dev/../pom.xml
./dev/../streaming/pom.xml
./dev/../mllib-local/pom.xml
./dev/../common/network-yarn/pom.xml
./dev/../common/network-common/pom.xml
./dev/../common/network-shuffle/pom.xml
./dev/../common/tags/pom.xml
./dev/../common/unsafe/pom.xml
./dev/../common/kvstore/pom.xml
./dev/../common/sketch/pom.xml
./dev/../examples/pom.xml
./dev/../external/kafka-0-10-assembly/pom.xml
./dev/../external/kinesis-asl-assembly/pom.xml
./dev/../external/flume/pom.xml
./dev/../external/flume-sink/pom.xml
./dev/../external/kafka-0-10-sql/pom.xml
./dev/../external/kafka-0-10/pom.xml
./dev/../external/kinesis-asl/pom.xml
./dev/../external/flume-assembly/pom.xml
./dev/../external/docker-integration-tests/pom.xml
./dev/../external/spark-ganglia-lgpl/pom.xml
./dev/../external/kafka-0-8-assembly/pom.xml
./dev/../external/kafka-0-8/pom.xml
./dev/../sql/core/pom.xml
./dev/../sql/catalyst/pom.xml
./dev/../sql/hive/pom.xml
./dev/../sql/hive-thriftserver/pom.xml
./dev/../docs/_plugins/copy_api_dirs.rb
QiColindeMacBook-Air:spark-2.3.1 Colin$
执行mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests clean package
命令开始编译Spark源码,参数表示生成的版本支持yarn,hadoop,hive。 最后生成源码包目录下生成jar包。Hadoop的版本一定要与集群安装的Hadoop版本对应,编译时间大概会持续半个小时,编译期间会出现很多警告,不用担心。
QiColindeMacBook-Air:spark-2.3.1 Colin$ mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests clean package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Spark Project Parent POM [pom]
[INFO] Spark Project Tags [jar]
[INFO] Spark Project Sketch [jar]
[INFO] Spark Project Local DB [jar]
[INFO] Spark Project Networking [jar]
[INFO] Spark Project Shuffle Streaming Service [jar]
[INFO] Spark Project Unsafe [jar]
[INFO] Spark Project Launcher [jar]
[INFO] Spark Project Core [jar]
[INFO] Spark Project ML Local Library [jar]
[INFO] Spark Project GraphX [jar]
[INFO] Spark Project Streaming [jar]
[INFO] Spark Project Catalyst [jar]
[INFO] Spark Project SQL [jar]
[INFO] Spark Project ML Library [jar]
[INFO] Spark Project Tools [jar]
[INFO] Spark Project Hive [jar]
[INFO] Spark Project REPL [jar]
[INFO] Spark Project YARN Shuffle Service [jar]
[INFO] Spark Project YARN [jar]
[INFO] Spark Project Hive Thrift Server [jar]
[INFO] Spark Project Assembly [pom]
[INFO] Spark Integration for Kafka 0.10 [jar]
[INFO] Kafka 0.10 Source for Structured Streaming [jar]
[INFO] Spark Project Examples [jar]
[INFO] Spark Integration for Kafka 0.10 Assembly [jar]
[INFO]
[INFO] -----------------< org.apache.spark:spark-parent_2.11 >-----------------
[INFO] Building Spark Project Parent POM 2.3.1 [1/26]
[INFO] --------------------------------[ pom ]---------------------------------
编译完成的会显示编译成功的结果。
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM 2.3.1 ..................... SUCCESS [ 10.405 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 18.758 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 15.612 s]
[INFO] Spark Project Local DB ............................. SUCCESS [ 7.689 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 14.848 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 7.291 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 23.968 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 15.745 s]
[INFO] Spark Project Core ................................. SUCCESS [05:25 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [ 46.166 s]
[INFO] Spark Project GraphX ............................... SUCCESS [01:05 min]
[INFO] Spark Project Streaming ............................ SUCCESS [01:59 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:51 min]
[INFO] Spark Project SQL .................................. SUCCESS [07:46 min]
[INFO] Spark Project ML Library ........................... SUCCESS [07:09 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 17.955 s]
[INFO] Spark Project Hive ................................. SUCCESS [05:05 min]
[INFO] Spark Project REPL ................................. SUCCESS [01:04 min]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 33.794 s]
[INFO] Spark Project YARN ................................. SUCCESS [02:29 min]
[INFO] Spark Project Hive Thrift Server ................... SUCCESS [02:16 min]
[INFO] Spark Project Assembly ............................. SUCCESS [ 12.661 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [01:20 min]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [01:41 min]
[INFO] Spark Project Examples ............................. SUCCESS [01:33 min]
[INFO] Spark Integration for Kafka 0.10 Assembly 2.3.1 .... SUCCESS [ 9.186 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 46:42 min
[INFO] Finished at: 2018-07-14T09:51:39+08:00
[INFO] ------------------------------------------------------------------------
QiColindeMacBook-Air:spark-2.3.1 Colin$
4、打包部署
Maven编译成功后在源码包下找到脚本make-distribution.sh执行执行./dev/./make-distribution.sh --name 2.7.6hive --tgz -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests
将编译结果打包,最终在源码包目录下生成spark-2.3.1-bin-2.7.6hive.tgz编译包,可以直接部署,整个过程也差不多半小时。
QiColindeMacBook-Air:spark-2.3.1 Colin$ ./dev/./make-distribution.sh --name 2.7.6hive --tgz -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests
+++ dirname ./dev/./make-distribution.sh
++ cd ./dev/./..
++ pwd
+ SPARK_HOME=/Users/Colin/Downloads/spark-2.3.1
+ DISTDIR=/Users/Colin/Downloads/spark-2.3.1/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/Colin/Downloads/spark-2.3.1/build/mvn
+ (( 9 ))
+ case $1 in
+ NAME=2.7.6hive
+ shift
+ shift
+ (( 7 ))
+ case $1 in
+ MAKE_TGZ=true
+ shift
+ (( 6 ))
+ case $1 in
+ break
+ '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_73.jdk/Contents/Home ']'
+ '[' -z /Library/Java/JavaVirtualMachines/jdk1.8.0_73.jdk/Contents/Home ']'
++ command -v git
+ '[' /usr/local/bin/git ']'
++ git rev-parse --short HEAD
++ :
+ GITREV=
+ '[' '!' -z '' ']'
+ unset GITREV
++ command -v /Users/Colin/Downloads/spark-2.3.1/build/mvn
+ '[' '!' /Users/Colin/Downloads/spark-2.3.1/build/mvn ']'
++ /Users/Colin/Downloads/spark-2.3.1/build/mvn help:evaluate -Dexpression=project.version -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests
++ grep -v INFO
++ tail -n 1
+ VERSION=2.3.1
++ /Users/Colin/Downloads/spark-2.3.1/build/mvn help:evaluate -Dexpression=scala.binary.version -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests
++ grep -v INFO
++ tail -n 1
+ SCALA_VERSION=2.11
++ /Users/Colin/Downloads/spark-2.3.1/build/mvn help:evaluate -Dexpression=hadoop.version -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests
++ grep -v INFO
++ tail -n 1
+ SPARK_HADOOP_VERSION=2.7.6
++ /Users/Colin/Downloads/spark-2.3.1/build/mvn help:evaluate -Dexpression=project.activeProfiles -pl sql/hive -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests
++ grep -v INFO
++ fgrep --count 'hive '
++ echo -n
+ SPARK_HIVE=1
+ '[' 2.7.6hive == none ']'
+ echo 'Spark version is 2.3.1'
Spark version is 2.3.1
+ '[' true == true ']'
+ echo 'Making spark-2.3.1-bin-2.7.6hive.tgz'
Making spark-2.3.1-bin-2.7.6hive.tgz
+ cd /Users/Colin/Downloads/spark-2.3.1
+ export 'MAVEN_OPTS=-Xmx2g -XX:ReservedCodeCacheSize=512m'
+ MAVEN_OPTS='-Xmx2g -XX:ReservedCodeCacheSize=512m'
+ BUILD_COMMAND=("$MVN" -T 1C clean package -DskipTests $@)
+ echo -e '\nBuilding with...'
Building with...
+ echo -e '$ /Users/Colin/Downloads/spark-2.3.1/build/mvn' -T 1C clean package -DskipTests -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver '-DskipTests\n'
$ /Users/Colin/Downloads/spark-2.3.1/build/mvn -T 1C clean package -DskipTests -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests
+ /Users/Colin/Downloads/spark-2.3.1/build/mvn -T 1C clean package -DskipTests -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.6 -Phive -Phive-thriftserver -DskipTests
Using `mvn` from path: /usr/local/bin/mvn
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Spark Project Parent POM [pom]
[INFO] Spark Project Tags [jar]
[INFO] Spark Project Sketch [jar]
[INFO] Spark Project Local DB [jar]
[INFO] Spark Project Networking [jar]
[INFO] Spark Project Shuffle Streaming Service [jar]
[INFO] Spark Project Unsafe [jar]
[INFO] Spark Project Launcher [jar]
[INFO] Spark Project Core [jar]
[INFO] Spark Project ML Local Library [jar]
[INFO] Spark Project GraphX [jar]
[INFO] Spark Project Streaming [jar]
[INFO] Spark Project Catalyst [jar]
[INFO] Spark Project SQL [jar]
[INFO] Spark Project ML Library [jar]
[INFO] Spark Project Tools [jar]
[INFO] Spark Project Hive [jar]
[INFO] Spark Project REPL [jar]
[INFO] Spark Project YARN Shuffle Service [jar]
[INFO] Spark Project YARN [jar]
[INFO] Spark Project Hive Thrift Server [jar]
[INFO] Spark Project Assembly [pom]
[INFO] Spark Integration for Kafka 0.10 [jar]
[INFO] Kafka 0.10 Source for Structured Streaming [jar]
[INFO] Spark Project Examples [jar]
[INFO] Spark Integration for Kafka 0.10 Assembly [jar]
[INFO]
[INFO] Using the MultiThreadedBuilder implementation with a thread count of 4
[INFO]
[INFO] -----------------< org.apache.spark:spark-parent_2.11 >-----------------
[INFO] Building Spark Project Parent POM 2.3.1 [1/26]
[INFO] --------------------------------[ pom ]---------------------------------
编译打包成功。
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/aircompressor-0.8.jar
+ name=aircompressor-0.8.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/aircompressor-0.8.jar ']'
+ rm /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/aircompressor-0.8.jar
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/commons-codec-1.10.jar
+ name=commons-codec-1.10.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/commons-codec-1.10.jar ']'
+ rm /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/commons-codec-1.10.jar
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/commons-lang-2.6.jar
+ name=commons-lang-2.6.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/commons-lang-2.6.jar ']'
+ rm /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/commons-lang-2.6.jar
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/kryo-shaded-3.0.3.jar
+ name=kryo-shaded-3.0.3.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/kryo-shaded-3.0.3.jar ']'
+ rm /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/kryo-shaded-3.0.3.jar
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/minlog-1.3.0.jar
+ name=minlog-1.3.0.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/minlog-1.3.0.jar ']'
+ rm /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/minlog-1.3.0.jar
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/objenesis-2.1.jar
+ name=objenesis-2.1.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/objenesis-2.1.jar ']'
+ rm /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/objenesis-2.1.jar
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/orc-core-1.4.4-nohive.jar
+ name=orc-core-1.4.4-nohive.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/orc-core-1.4.4-nohive.jar ']'
+ rm /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/orc-core-1.4.4-nohive.jar
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/orc-mapreduce-1.4.4-nohive.jar
+ name=orc-mapreduce-1.4.4-nohive.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/orc-mapreduce-1.4.4-nohive.jar ']'
+ rm /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/orc-mapreduce-1.4.4-nohive.jar
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/scopt_2.11-3.7.0.jar
+ name=scopt_2.11-3.7.0.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/scopt_2.11-3.7.0.jar ']'
+ for f in '"$DISTDIR"/examples/jars/*'
++ basename /Users/Colin/Downloads/spark-2.3.1/dist/examples/jars/spark-examples_2.11-2.3.1.jar
+ name=spark-examples_2.11-2.3.1.jar
+ '[' -f /Users/Colin/Downloads/spark-2.3.1/dist/jars/spark-examples_2.11-2.3.1.jar ']'
+ mkdir -p /Users/Colin/Downloads/spark-2.3.1/dist/examples/src/main
+ cp -r /Users/Colin/Downloads/spark-2.3.1/examples/src/main /Users/Colin/Downloads/spark-2.3.1/dist/examples/src/
+ cp /Users/Colin/Downloads/spark-2.3.1/LICENSE /Users/Colin/Downloads/spark-2.3.1/dist
+ cp -r /Users/Colin/Downloads/spark-2.3.1/licenses /Users/Colin/Downloads/spark-2.3.1/dist
+ cp /Users/Colin/Downloads/spark-2.3.1/NOTICE /Users/Colin/Downloads/spark-2.3.1/dist
+ '[' -e /Users/Colin/Downloads/spark-2.3.1/CHANGES.txt ']'
+ cp -r /Users/Colin/Downloads/spark-2.3.1/data /Users/Colin/Downloads/spark-2.3.1/dist
+ '[' false == true ']'
+ echo 'Skipping building python distribution package'
Skipping building python distribution package
+ '[' false == true ']'
+ echo 'Skipping building R source package'
Skipping building R source package
+ mkdir /Users/Colin/Downloads/spark-2.3.1/dist/conf
+ cp /Users/Colin/Downloads/spark-2.3.1/conf/docker.properties.template /Users/Colin/Downloads/spark-2.3.1/conf/fairscheduler.xml.template /Users/Colin/Downloads/spark-2.3.1/conf/log4j.properties.template /Users/Colin/Downloads/spark-2.3.1/conf/metrics.properties.template /Users/Colin/Downloads/spark-2.3.1/conf/slaves.template /Users/Colin/Downloads/spark-2.3.1/conf/spark-defaults.conf.template /Users/Colin/Downloads/spark-2.3.1/conf/spark-env.sh.template /Users/Colin/Downloads/spark-2.3.1/dist/conf
+ cp /Users/Colin/Downloads/spark-2.3.1/README.md /Users/Colin/Downloads/spark-2.3.1/dist
+ cp -r /Users/Colin/Downloads/spark-2.3.1/bin /Users/Colin/Downloads/spark-2.3.1/dist
+ cp -r /Users/Colin/Downloads/spark-2.3.1/python /Users/Colin/Downloads/spark-2.3.1/dist
+ '[' false == true ']'
+ cp -r /Users/Colin/Downloads/spark-2.3.1/sbin /Users/Colin/Downloads/spark-2.3.1/dist
+ '[' -d /Users/Colin/Downloads/spark-2.3.1/R/lib/SparkR ']'
+ '[' true == true ']'
+ TARDIR_NAME=spark-2.3.1-bin-2.7.6hive
+ TARDIR=/Users/Colin/Downloads/spark-2.3.1/spark-2.3.1-bin-2.7.6hive
+ rm -rf /Users/Colin/Downloads/spark-2.3.1/spark-2.3.1-bin-2.7.6hive
+ cp -r /Users/Colin/Downloads/spark-2.3.1/dist /Users/Colin/Downloads/spark-2.3.1/spark-2.3.1-bin-2.7.6hive
+ tar czf spark-2.3.1-bin-2.7.6hive.tgz -C /Users/Colin/Downloads/spark-2.3.1 spark-2.3.1-bin-2.7.6hive
+ rm -rf /Users/Colin/Downloads/spark-2.3.1/spark-2.3.1-bin-2.7.6hive
QiColindeMacBook-Air:spark-2.3.1 Colin$
用编译好的安装包安装Spark,配置环境变量后进入$SPARK_HOME/conf目录,编辑spark-env.sh文件,加入mySQL的JDBC驱动包路径。
export CLASSPATH=$CLASSPATH:/usr/local/Cellar/hive/lib
export HIVE_CONF_DIR=/usr/local/Cellar/hive/conf
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/local/Cellar/hive/lib/mysql-connector-java-5.1.46-bin.jar
export SPARK_DIST_CLASSPATH=$(/usr/local/Cellar/hadoop/2.7.6/bin/hadoop classpath)
export SPARK_MASTER_IP=localhost
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=1G
export SPARK_WORKER_PORT=8888
将hive中的配置文件hive-site.xml拷贝到$SPARK_HOME/conf目录下。这样Spark的配置就完成了,启动Hadoop,hive,Spark。进入spark-shell,如果可以import org.apache.spark.sql.hive.HiveContext
就说明该版本的Spark可以整合hive了。
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala>