Spark源码编译与本地调试环境搭建

Spark源码编译与本地调试环境搭建

    • 前置准备工作
    • Spark 源码编译
      • Spark-1.2.3 源码编译
      • Spark-2.2.4 源码编译
    • 源码阅读环境搭建

最近开始读《深入理解Spark·核心思想与源码分析》,书是16年出版的,基于 Spark1.2 版本进行讲解。刚开始感觉Spark版本有点老,读了三章后发现作者思路清晰,可学习的地方很多。
想要本地方便地阅读和调试源码,需要在本地IDE顺利build通过。下面是跟着书中的讲解做的Spark源码编译和本地调试环境的搭建。作者选用的IDE是Eclipse,构建工具是SBT;我选用的IDE是IntelliJ IDEA,构建工具是Maven。

前置准备工作

以下各前置工作相对简单,不再详细描述。我本地环境是 JDK-1.8,Scala-2.12.4,Maven-3.3.9,Git-2.17.1。

  • JDK 安装和环境变量的配置
  • Scala 安装和环境变量配置
  • Maven 安装和环境变量配置
  • Git 客户端安装和 SSH 配置

Spark 源码编译

Spark-1.2.3 源码编译

首先需要clone源码到本地仓库,这个过程可能需要等待较长的时间。

git clone [email protected]:apache/spark.git -b branch-1.2

clone完成后,开始进行源码编译。编译时可以设置 Maven 选项 -T 1C,使用 Maven 提供的并行编译能力,表示每个 CPU 核心跑一个线程,加快编译速度;-e 选项表示遇到异常打印异常详细信息。

mvn -T 1C -DskipTests -e clean install

第一次编译过程中遇到如下报错信息:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [  2.824 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 13.824 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 10.192 s]
[INFO] Spark Project Core ................................. SUCCESS [02:38 min]
[INFO] Spark Project Bagel ................................ SUCCESS [ 27.409 s]
[INFO] Spark Project GraphX ............................... SUCCESS [01:06 min]
[INFO] Spark Project Streaming ............................ SUCCESS [01:40 min]
[INFO] Spark Project Catalyst ............................. SKIPPED
[INFO] Spark Project SQL .................................. SKIPPED
[INFO] Spark Project ML Library ........................... SKIPPED
[INFO] Spark Project Tools ................................ SKIPPED
[INFO] Spark Project Hive ................................. SKIPPED
[INFO] Spark Project REPL ................................. SKIPPED
[INFO] Spark Project Assembly ............................. SKIPPED
[INFO] Spark Project External Twitter ..................... SKIPPED
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 18.871 s]
[INFO] Spark Project External Flume ....................... FAILURE [ 10.060 s]
[INFO] Spark Project External MQTT ........................ SKIPPED
[INFO] Spark Project External ZeroMQ ...................... SKIPPED
[INFO] Spark Project External Kafka ....................... SKIPPED
[INFO] Spark Project Examples ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-streaming-flume_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-streaming-flume_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed.
...
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-streaming-flume_2.10

经过排查,报错原因是项目中缺少了一个依赖:

<dependency>
    <groupId>net.alchim31.mavengroupId>
    <artifactId>scala-maven-pluginartifactId>
    <version>3.2.0version>
dependency>

在项目的主 pom.xml 中引入上述依赖后,再次执行编译命令进行编译。不过,在这里踩了一个坑:根据 Maven 报错信息最后一行提示,解决问题后可执行如下命令从当前状态恢复build过程:

mvn -rf :spark-streaming-flume_2.10

于是我按照提示,重新键入编译命令如下:

mvn -T 1C -DskipTests -e clean install -rf :spark-streaming-flume_2.10

结果,编译过程中又得到如下报错信息:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project External Flume ....................... SUCCESS [ 25.568 s]
[INFO] Spark Project External MQTT ........................ SUCCESS [03:45 min]
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [ 48.209 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [01:49 min]
[INFO] Spark Project Examples ............................. FAILURE [06:12 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 09:58 min (Wall Clock)
···
[ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.2.3-SNAPSHOT: The following artifacts could not be resolved: org.apache.spark:spark-mllib_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-hive_2.10:jar:1.2.3-SNAPSHOT, org.apache.spark:spark-streaming-twitter_2.10:jar:1.2.3-SNAPSHOT: Could not find artifact org.apache.spark:spark-mllib_2.10:jar:1.2.3-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots) -> [Help 1]
···

从输出信息可以看出:使用上述命令恢复 build 后,Maven 只是简单的从 failure 的模块开始往下执行,但并没有将之前跳过的依赖模块重新build,如 Spark Project ML Library 等,见下图:
Spark源码编译与本地调试环境搭建_第1张图片
所以,到这里我只能clean后重新build:

mvn -T 1C -DskipTests -e clean install

经过上述折腾之后,重新build比较顺利,总计耗时14分钟:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [  3.703 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 16.933 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 11.269 s]
[INFO] Spark Project Core ................................. SUCCESS [02:44 min]
[INFO] Spark Project Bagel ................................ SUCCESS [ 28.633 s]
[INFO] Spark Project GraphX ............................... SUCCESS [01:12 min]
[INFO] Spark Project Streaming ............................ SUCCESS [01:41 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [02:09 min]
[INFO] Spark Project SQL .................................. SUCCESS [01:38 min]
[INFO] Spark Project ML Library ........................... SUCCESS [05:52 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 32.681 s]
[INFO] Spark Project Hive ................................. SUCCESS [05:33 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 46.894 s]
[INFO] Spark Project Assembly ............................. SUCCESS [ 27.147 s]
[INFO] Spark Project External Twitter ..................... SUCCESS [ 57.706 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 20.473 s]
[INFO] Spark Project External Flume ....................... SUCCESS [01:09 min]
[INFO] Spark Project External MQTT ........................ SUCCESS [ 57.697 s]
[INFO] Spark Project External ZeroMQ ...................... SUCCESS [ 58.706 s]
[INFO] Spark Project External Kafka ....................... SUCCESS [01:12 min]
[INFO] Spark Project Examples ............................. SUCCESS [01:08 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 14:10 min (Wall Clock)
···

Spark-2.2.4 源码编译

由于1.2 版本的源码相对较老,后续学习工作中还是选用较新的 2.2 版本,跟着书本对比阅读。这里直接切换到 2.2 分支进行build:

git checkout branch-2.2

切换成功后,使用同样的命令开始编译:

mvn -T 1C -DskipTests -e clean install

源码编译过程中,CPU的利用率如下图:
Spark源码编译与本地调试环境搭建_第2张图片
整个源码编译过程大约耗时20分钟,没有出现异常,最后顺利完成,输出的构建信息如下:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [ 23.283 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 12.210 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 21.720 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 44.170 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 16.677 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 37.268 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 50.450 s]
[INFO] Spark Project Core ................................. SUCCESS [05:35 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [01:00 min]
[INFO] Spark Project GraphX ............................... SUCCESS [01:00 min]
[INFO] Spark Project Streaming ............................ SUCCESS [01:53 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [03:52 min]
[INFO] Spark Project SQL .................................. SUCCESS [04:55 min]
[INFO] Spark Project ML Library ........................... SUCCESS [03:36 min]
[INFO] Spark Project Tools ................................ SUCCESS [ 13.582 s]
[INFO] Spark Project Hive ................................. SUCCESS [02:49 min]
[INFO] Spark Project REPL ................................. SUCCESS [ 23.556 s]
[INFO] Spark Project Assembly ............................. SUCCESS [  3.623 s]
[INFO] Spark Project External Flume Sink .................. SUCCESS [ 44.048 s]
[INFO] Spark Project External Flume ....................... SUCCESS [ 42.474 s]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [  2.933 s]
[INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [01:09 min]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [ 50.854 s]
[INFO] Spark Project Examples ............................. SUCCESS [ 55.074 s]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [  3.453 s]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 59.934 s]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [  8.501 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 20:32 min (Wall Clock)
···

源码阅读环境搭建

源码编译完成,进行简单测试。首先找到 examples 模块中的JavaWordCount.java,看到部分代码如下:

if (args.length < 1) {
	System.err.println("Usage: JavaWordCount ");
	System.exit(1);
}

SparkSession spark = SparkSession
  .builder()
  .appName("JavaWordCount")
  .getOrCreate();

JavaRDD<String> lines = spark.read().textFile(args[0]).javaRDD();

这里,需要做的配置有两点:

  1. 根据源码可以看出,这个例子接收一个外部参数 args[0],即需要统计的文件名。这里,我直接修改应用的运行配置,添加一个参数“E:\WorkSpace\IntelliJ IDEA\apache-spark\spark\README.md”,如下图
    Spark源码编译与本地调试环境搭建_第3张图片
  2. 需要修改 Spark 的运行模式为 local 模式:
SparkSession spark = SparkSession
  .builder()
  .appName("JavaWordCount").master("local")
  .getOrCreate();

配置完成后,运行 JavaWordCount,输出信息如下:

MLlib: 1
["Building: 1
contributing: 1
shell:: 2
instance:: 1
Scala,: 1
and: 9
command,: 2
package.): 1
./dev/run-tests: 1
sample: 1
19/07/22 14:15:20 INFO AbstractConnector: Stopped Spark@61f05988{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}

这样,看到第一个 example 就跑通了,接下来就可以愉快地阅读和调试源码了。

你可能感兴趣的:(大数据,Spark,2019)