Spark2.4.0 源码编译

Spark源码编译

源码下载

从github上下载最新版本spark源码
https://github.com/apache/spark

Apache Maven(Maven编译)

基于maven的编译的版本要求如下:
Maven版本:3.5.4+
Java版本:java8+

设置maven使用内存

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

如果没有设置上述参数,可能会报错:

[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.11/classes...
[ERROR] Java heap space -> [Help 1]

build/mvn

Spark提供了自动化maven编译脚本,会自动下载安装编译所需要的Maven,Scala,Zinc。
编译命令

./build/mvn -DskipTests clean package

mac环境下,如果你曾从bash风格切换到zsh风格之后,没有在.zshrc中配置JAVA_HOME环境变量,可能会报错:

Cannot run program "/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/jre/bin/javac": error=2, No such file or directory

在 ~/.zshrc 配置文件中配置JAVA_HOME即可。

building...

stefan@localhost  ~/Documents/workspace/code/spark   master  ./build/mvn -DskipTests clean package
Using `mvn` from path: /Users/stefan/Documents/workspace/code/spark/build/apache-maven-3.6.0/bin/mvn
[INFO] Scanning for projects...
...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [  4.010 s]
[INFO] Spark Project Tags ................................. SUCCESS [  7.204 s]
[INFO] Spark Project Sketch ............................... SUCCESS [  6.099 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  3.870 s]
[INFO] Spark Project Networking ........................... SUCCESS [  8.308 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  3.860 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  6.418 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  5.159 s]
[INFO] Spark Project Core ................................. SUCCESS [02:01 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [  5.823 s]
[INFO] Spark Project GraphX ............................... SUCCESS [  8.543 s]
[INFO] Spark Project Streaming ............................ SUCCESS [ 21.891 s]
[INFO] Spark Project Catalyst ............................. SUCCESS [01:15 min]
[INFO] Spark Project SQL .................................. SUCCESS [02:28 min]
[INFO] Spark Project ML Library ........................... SUCCESS [01:13 min]
[INFO] Spark Project Tools ................................ SUCCESS [  1.534 s]
[INFO] Spark Project Hive ................................. SUCCESS [ 56.505 s]
[INFO] Spark Project REPL ................................. SUCCESS [  5.497 s]
[INFO] Spark Project Assembly ............................. SUCCESS [  4.034 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [  6.713 s]
[INFO] Kafka 0.10+ Token Provider for Streaming ........... SUCCESS [  2.156 s]
[INFO] Kafka 0.10+ Source for Structured Streaming ........ SUCCESS [  9.314 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 14.136 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [  3.357 s]
[INFO] Spark Avro ......................................... SUCCESS [  5.773 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  10:09 min
[INFO] Finished at: 2019-03-17T11:11:29+08:00
[INFO] ------------------------------------------------------------------------

Building a Runnable Distribution(编译可运行的分布式版本)

Spark提供了自动化的分布式编译脚本:./dev/make-distribution.sh

脚本各参数含义可以通过命令

./dev/make-distribution.sh --help

查看。

✘ stefan@localhost  ~/Documents/workspace/code/spark   master  ./dev/make-distribution.sh --help
+++ dirname ./dev/make-distribution.sh
++ cd ./dev/..
++ pwd
+ SPARK_HOME=/Users/didi/Documents/workspace/code/spark
+ DISTDIR=/Users/didi/Documents/workspace/code/spark/dist
+ MAKE_TGZ=false
+ MAKE_PIP=false
+ MAKE_R=false
+ NAME=none
+ MVN=/Users/didi/Documents/workspace/code/spark/build/mvn
+ ((  1  ))
+ case $1 in
+ exit_with_usage
+ echo 'make-distribution.sh - tool for making binary distributions of Spark'
make-distribution.sh - tool for making binary distributions of Spark
+ echo ''

+ echo usage:
usage:
+ cl_options='[--name] [--tgz] [--pip] [--r] [--mvn ]'
+ echo 'make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn ] '
make-distribution.sh [--name] [--tgz] [--pip] [--r] [--mvn ] 
+ echo 'See Spark'\''s "Building Spark" doc for correct Maven options.'
See Spark's "Building Spark" doc for correct Maven options.
+ echo ''

+ exit 1

编译命令

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes

上述命令会编译spark分发包,Python pip 和R包。执行前,请确认本地安装了R。

Specifying the Hadoop Version and Enabling YARN(指定Hadoop版本并启用YARN)

可以通过hadoop.version参数指定Hadoop编译版本,如果不指定,Spark将默认使用Hadoop2.6.X版本编译。
编译命令

# Apache Hadoop 2.6.X
./build/mvn -Pyarn -DskipTests clean package

# Apache Hadoop 2.7.X and later
./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.7 -DskipTests clean package

building...

✘ stefan@localhost  ~/Documents/workspace/code/spark   master  ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.7 -DskipTests clean package
Using `mvn` from path: /Users/didi/Documents/workspace/code/spark/build/apache-maven-3.6.0/bin/mvn
[WARNING]
[WARNING] Some problems were encountered while building the effective toolchains
[WARNING] expected START_TAG or END_TAG not TEXT (position: TEXT seen ...\n   \n  -->z\n\n

Building With Hive and JDBC Support(支持Hive和JDBC编译)

集成Spark SQL,Hive和JDBC,如果不指定,将默认绑定Hive 1.2.1编译。

编译命令

# With Hive 1.2.1 support
./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package

building...

 stefan@localhost  ~/Documents/workspace/code/spark   master  ./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
Using `mvn` from path: /Users/didi/Documents/workspace/code/spark/build/apache-maven-3.6.0/bin/mvn
[WARNING]
[WARNING] Some problems were encountered while building the effective toolchains
[WARNING] expected START_TAG or END_TAG not TEXT (position: TEXT seen ...\n   \n  -->z\n\n

Packaging without Hadoop Dependencies for YARN(不包含hadoop依赖的yarn打包)

采用hadoop-provided profile编译时,会排除hadoop依赖进行编译打包。
编译命令

./build/mvn -Dhadoop-provided -DskipTests clean package

building

stefan@localhost  ~/Documents/workspace/code/spark   master  ./build/mvn -Dhadoop-provided -DskipTests clean package
Using `mvn` from path: /Users/didi/Documents/workspace/code/spark/build/apache-maven-3.6.0/bin/mvn
[WARNING]
[WARNING] Some problems were encountered while building the effective toolchains
[WARNING] expected START_TAG or END_TAG not TEXT (position: TEXT seen ...\n   \n  -->z\n\n

至此,我们演示了几种常用的编译方式。

测试成功

开启spark-shell

✘ didi@localhost  ~/Documents/workspace/code/spark   master  ./bin/spark-shell
19/03/17 13:58:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://bogon:4040
Spark context available as 'sc' (master = local[*], app id = local-1552805589600).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
      /_/

Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

成功打开spark-shell交互界面,说明编译成功。
后面我们将介绍如何在本地进行Spark本地源码的开发测试。
参考:http://spark.apache.org/docs/latest/building-spark.html

你可能感兴趣的:(Spark2.4.0 源码编译)