以下各前置工作相对简单,不再详细描述。我本地环境是 JDK-1.8,Scala-2.12.4,Maven-3.3.9,Git-2.17.1。
首先需要clone源码到本地仓库,这个过程可能需要等待较长的时间。
git clone [email protected]:apache/spark.git -b branch-1.2
clone完成后,开始进行源码编译。编译时可以设置 Maven 选项 -T 1C,使用 Maven 提供的并行编译能力,表示每个 CPU 核心跑一个线程,加快编译速度;-e 选项表示遇到异常打印异常详细信息。
mvn -T 1C -DskipTests -e clean install
第一次编译过程中遇到如下报错信息:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 2.824 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 13.824 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 10.192 s]
[INFO] Spark Project Core ................................. SUCCESS [02:38 min]
[INFO] Spark Project Bagel ................................ SUCCESS [ 27.409 s]
[INFO] Spark Project GraphX ............................... SUCCESS [01:06 min]
[INFO] Spark Project Streaming ............................ SUCCESS [01:40 min]
[INFO] Spark Project Catalyst ............................. SKIPPED
[INFO] Spark Project SQL .................................. SKIPPED
[INFO] Spark Project ML Library ........................... SKIPPED
[INFO] Spark Project Tools ................................ SKIPPED
[INFO] Spark Project Hive ................................. SKIPPED
[INFO] Spark Project REPL ................................. SKIPPED
[INFO] Spark Project Assembly ............................. SKIPPED
[INFO] Spark Project External Twitter ..................... SKIPPED
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 18.871 s]
[INFO] Spark Project External Flume ....................... FAILURE [ 10.060 s]
[INFO] Spark Project External MQTT ........................ SKIPPED
[INFO] Spark Project External ZeroMQ ...................... SKIPPED
[INFO] Spark Project External Kafka ....................... SKIPPED
[INFO] Spark Project Examples ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-streaming-flume_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-streaming-flume_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed.
...
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :spark-streaming-flume_2.10
经过排查,报错原因是项目中缺少了一个依赖:
<dependency>
<groupId>net.alchim31.mavengroupId>
<artifactId>scala-maven-pluginartifactId>
<version>3.2.0version>
dependency>
在项目的主 pom.xml 中引入上述依赖后,再次执行编译命令进行编译。不过,在这里踩了一个坑:根据 Maven 报错信息最后一行提示,解决问题后可执行如下命令从当前状态恢复build过程:
mvn
-rf :spark-streaming-flume_2.10
于是我按照提示,重新键入编译命令如下:
mvn -T 1C -DskipTests -e clean install -rf :spark-streaming-flume_2.10
结果,编译过程中又得到如下报错信息:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project External Flume ....................... SUCCESS [ 25.568 s]
[INFO] Spark Project External MQTT ........................ SUCCESS [03:45 min]
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [ 48.209 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [01:49 min]
[INFO] Spark Project Examples ............................. FAILURE [06:12 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 09:58 min (Wall Clock)
···
[ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-mllib_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-hive_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-streaming-twitter_2.10:jar:1.2.3-SNAPSHOT: Could not find artifact org.apache.spark:spark-mllib_2.10:jar:1.2.3-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots) -> [Help 1]
···
从输出信息可以看出:使用上述命令恢复 build 后,Maven 只是简单的从 failure 的模块开始往下执行,但并没有将之前跳过的依赖模块重新build,如 Spark Project ML Library 等,见下图:
所以,到这里我只能clean后重新build:
mvn -T 1C -DskipTests -e clean install
经过上述折腾之后,重新build比较顺利,总计耗时14分钟:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 3.703 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 16.933 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 11.269 s]
[INFO] Spark Project Core ................................. SUCCESS [02:44 min]
[INFO] Spark Project Bagel ................................ SUCCESS [ 28.633 s]
[INFO] Spark Project GraphX ............................... SUCCESS [01:12 min]
[INFO] Spark Project Streaming ............................ SUCCESS [01:41 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [02:09 min]
[INFO] Spark Project SQL .................................. SUCCESS [01:38 min]
[INFO] Spark Project ML Library ........................... SUCCESS [05:52 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 32.681 s]
[INFO] Spark Project Hive ................................. SUCCESS [05:33 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 46.894 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 27.147 s]
[INFO] Spark Project External Twitter ..................... SUCCESS [ 57.706 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 20.473 s]
[INFO] Spark Project External Flume ....................... SUCCESS [01:09 min]
[INFO] Spark Project External MQTT ........................ SUCCESS [ 57.697 s]
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [ 58.706 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [01:12 min]
[INFO] Spark Project Examples ............................. SUCCESS [01:08 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 14:10 min (Wall Clock)
···
由于1.2 版本的源码相对较老,后续学习工作中还是选用较新的 2.2 版本,跟着书本对比阅读。这里直接切换到 2.2 分支进行build:
git checkout branch-2.2
切换成功后,使用同样的命令开始编译:
mvn -T 1C -DskipTests -e clean install
源码编译过程中,CPU的利用率如下图:
整个源码编译过程大约耗时20分钟,没有出现异常,最后顺利完成,输出的构建信息如下:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 23.283 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 12.210 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 21.720 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 44.170 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 16.677 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 37.268 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 50.450 s]
[INFO] Spark Project Core ................................. SUCCESS [05:35 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [01:00 min]
[INFO] Spark Project GraphX ............................... SUCCESS [01:00 min]
[INFO] Spark Project Streaming ............................ SUCCESS [01:53 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:52 min]
[INFO] Spark Project SQL .................................. SUCCESS [04:55 min]
[INFO] Spark Project ML Library ........................... SUCCESS [03:36 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 13.582 s]
[INFO] Spark Project Hive ................................. SUCCESS [02:49 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 23.556 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 3.623 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 44.048 s]
[INFO] Spark Project External Flume ....................... SUCCESS [ 42.474 s]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [ 2.933 s]
[INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [01:09 min]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [ 50.854 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 55.074 s]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 3.453 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 59.934 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 8.501 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 20:32 min (Wall Clock)
···
源码编译完成,进行简单测试。首先找到 examples 模块中的JavaWordCount.java,看到部分代码如下:
if (args.length < 1) {
System.err.println("Usage: JavaWordCount " );
System.exit(1);
}
SparkSession spark = SparkSession
.builder()
.appName("JavaWordCount")
.getOrCreate();
JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD();
这里,需要做的配置有两点:
SparkSession spark = SparkSession
.builder()
.appName("JavaWordCount").master("local")
.getOrCreate();
配置完成后,运行 JavaWordCount,输出信息如下:
MLlib: 1
["Building: 1
contributing: 1
shell:: 2
instance:: 1
Scala,: 1
and: 9
command,: 2
package.): 1
./dev/run-tests: 1
sample: 1
19/07/22 14:15:20 INFO AbstractConnector: Stopped Spark@61f05988{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
这样,看到第一个 example 就跑通了,接下来就可以愉快地阅读和调试源码了。