Spark源码编译和问题的解决

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

     对源码进行编译虽然有点自虐,但可以帮助自己更好地了解其中的细节,为以后的深入和解决配置问题打下基础,否则遇到问题可能会束手无策。这里介绍Spark的编译过程[来自于:http://www.iteblog.com/archives/1038],但是开源软件的演进是很快 的,Spark的最新版本已经到1.5了,Hadoop的最新版本已经2.6了,需要根据实际情况进行摸索和调整。  

      目前Spark已经更新到1.0.0了,在本博客的《Spark 1.0.0于5月30日正式发布》中已经介绍了Spark 1.0.0的一些新特性。我们可以看到Spark 1.0.0带来了许多很不错的感受。本篇文章来介绍如何用Maven编译Spark 1.0.0源码。步骤主要如下:

一、先去Spark官网下载好源码。
1 # wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0.tgz
2 # tar -zxf spark-1.0.0.tgz
二、设置MAVEN_OPTS参数

  在编译Spark的时候Maven需要很多内存,否则会出现类似下面的错误信息:

01 Exception in thread "main" java.lang.OutOfMemoryError: PermGen space
02     at org.apache.maven.cli.MavenCli.execute(MavenCli.java:545)
03     at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
04     at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
05     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
06     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
07     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
08     at java.lang.reflect.Method.invoke(Method.java:597)
09     at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
10     at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
11     at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
12     at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)

解决方法是:

1 export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
三、 Cannot run program "javac": java.io.IOException:

  如果编译的过程出现以下错误,请设置一下Java path。

1 [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.1.6:
2 compile (scala-compile-first) on project spark-core_2.10: wrap:
3 java.io.IOException: Cannot run program "javac": java.io.IOException:
4  error=2, No such file or directory -> [Help 1]
四、 Please set the SCALA_HOME

  这个错误很明显没有设置SCALA_HOME,去下载一个scala,然后设置一下即可。

1 [ERROR] Failed to execute goal org.apache.maven.plugins:
2 maven-antrun-plugin:1.7:run (default) on project spark-core_2.10:
3  An Ant BuildException has occured: Please set the SCALA_HOME
4  (or SCALA_LIBRARY_PATH if scala is on the path) environment
5 variables and retry.
6 [ERROR] around Ant part ...... @ 6:126 in spark-1.0.0/core/target/antrun/build-main.xml
五、选择相应的Hadoop和Yarn版本

  因为不同版本的HDFS在协议上是不兼容的,所以如果你想用你的Spark从HDFS上读取数据,那么你就的选择相应版本的HDFS来编译 Spark,这个可以在编译的时候通过设置hadoop.version来选择,默认情况下,Spark是用Hadoop 1.0.4版本。

Hadoop version Profile required
0.23.x hadoop-0.23
1.x to 2.1.x (none)
2.2.x hadoop-2.2
2.3.x hadoop-2.3
2.4.x hadoop-2.4

  (1)、对于Apache Hadoop 1.x、Cloudera CDH的mr1发行版,这些版本没有 YARN,所以我们可以用下面的命令来编译Spark

1 # Apache Hadoop 1.2.1
2 mvn -Dhadoop.version=1.2.1 -DskipTests clean package
3
4 # Cloudera CDH 4.2.0 with MapReduce v1
5 mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests clean package
6
7 # Apache Hadoop 0.23.x
8 mvn -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package

  (2)、对于Apache Hadoop 2.x, 0.23.x,Cloudera CDH以及其它一些版本的Hadoop,它们都是带有YARN,所以你可以启用“yarn-alpha”或者“yarn”配置选项,并通过 yarn.version来设置不同版本的YARN,可选择的值如下:

YARN version Profile required
0.23.x 到 2.1.x yarn-alpha
2.2.x和之后版本 yarn

  我们可以通过下面命令来编译Spark

01 # Apache Hadoop 2.0.5-alpha
02 mvn -Pyarn-alpha -Dhadoop.version=2.0.5-alpha -DskipTests clean package
03
04 # Cloudera CDH 4.2.0
05 mvn -Pyarn-alpha -Dhadoop.version=2.0.0-cdh4.2.0 -DskipTests clean package
06
07 # Apache Hadoop 0.23.x
08 mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package
09
10 # Apache Hadoop 2.2.X
11 mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
12
13 # Apache Hadoop 2.3.X
14 mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
15
16 # Apache Hadoop 2.4.X
17 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
18
19 # Different versions of HDFS and YARN.
20 mvn -Pyarn-alpha -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=0.23.7
21                               -DskipTests clean package

  当然(1)我们也可以用sbt来编译Spark,本博客的《Spark 0.9.1源码编译》有详细的介绍,大家可以去参考。
  (2)、自己编译Spark可以学到许多东西,不过你完全可以去下载已经编译好的Spark,这完全由你自己去决定。
  (3)、本文原文出自: 《用Maven编译Spark 1.0.0源码以错误解决》: http://www.iteblog.com/archives/1038
  (4)、在下载下来的Spark源码中的同一级目录下有个make-distribution.sh脚本,这个脚本可以打包Spark的发行包,make-distribution.sh文件其实就是调用了Maven进行编译的,可以通过下面的命令运行:

1 ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests -Dhadoop.version=2.2.0

  大量关于Hadoop、Spark的干货博客:过往记忆:http://www.iteblog.com

如果你看到下面的输出信息,那恭喜你,编译成功了!

01 [WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin
02 [INFO] ------------------------------------------------------------------------
03 [INFO] Reactor Summary:
04 [INFO]
05 [INFO] Spark Project Parent POM .......................... SUCCESS [2.172s]
06 [INFO] Spark Project Core ................................ SUCCESS [3:14.405s]
07 [INFO] Spark Project Bagel ............................... SUCCESS [22.606s]
08 [INFO] Spark Project GraphX .............................. SUCCESS [56.679s]
09 [INFO] Spark Project Streaming ........................... SUCCESS [1:14.616s]
10 [INFO] Spark Project ML Library .......................... SUCCESS [1:31.366s]
11 [INFO] Spark Project Tools ............................... SUCCESS [15.484s]
12 [INFO] Spark Project Catalyst ............................ SUCCESS [1:13.788s]
13 [INFO] Spark Project SQL ................................. SUCCESS [1:22.578s]
14 [INFO] Spark Project Hive ................................ SUCCESS [1:10.762s]
15 [INFO] Spark Project REPL ................................ SUCCESS [36.957s]
16 [INFO] Spark Project YARN Parent POM ..................... SUCCESS [2.290s]
17 [INFO] Spark Project YARN Stable API ..................... SUCCESS [38.067s]
18 [INFO] Spark Project Assembly ............................ SUCCESS [23.663s]
19 [INFO] Spark Project External Twitter .................... SUCCESS [19.490s]
20 [INFO] Spark Project External Kafka ...................... SUCCESS [24.782s]
21 [INFO] Spark Project External Flume Sink ................. SUCCESS [24.539s]
22 [INFO] Spark Project External Flume ...................... SUCCESS [27.308s]
23 [INFO] Spark Project External ZeroMQ ..................... SUCCESS [21.148s]
24 [INFO] Spark Project External MQTT ....................... SUCCESS [2:00.741s]
25 [INFO] Spark Project Examples ............................ SUCCESS [54.435s]
26 [INFO] ------------------------------------------------------------------------
27 [INFO] BUILD SUCCESS
28 [INFO] ------------------------------------------------------------------------
29 [INFO] Total time: 17:58.481s
30 [INFO] Finished at: Tue Sep 16 19:20:10 CST 2014
31 [INFO] Final Memory: 76M/1509M
32 [INFO] ------------------------------------------------------------------------

本博客文章除特别声明,全部都是原创!
尊重原创,转载请注明: 转载自过往记忆(http://www.iteblog.com/)
本文链接地址: 《用Maven编译Spark 1.0.0源码以错误解决》(http://www.iteblog.com/archives/1038)


转载于:https://my.oschina.net/u/2306127/blog/546531

你可能感兴趣的:(Spark源码编译和问题的解决)