1. 下载 Spark1.4.1 源码包,并解压
进入 spark 官网下载:http://spark.apache.org/downloads.html
或者笔者分享的百度云盘:链接:http://pan.baidu.com/s/17I67O 密码:3cwf
笔者解压到
tar -zxvf spark-1.4.1.tgz -C /home/hadoop/softwares/
2. 安装 Maven
具体可参考笔者的 Maven 安装
3. 安装 Scala 2.10.4
下载地址:链接:http://pan.baidu.com/s/1sj46kn7 密码:fv4z
安装就是解压,然后配置环境变量,没啥了
export SCALA_HOME=/home/hadoop/softwares/scala-2.10.4
export PATH=${PATH}:$SCALA_HOME/bin
在 linux 安装下 Scala 环境,键入 scala -version
,出现如下即可:
4. 安装 Oracle 的 JDK 7
虽然笔者使用 Open-jdk 1.7 编译成功了,但是还是暂时推荐读者使用 Oracle 的 JDK 7。jdk 1.7 下载及安装,具体参考笔者的 JAVA 配置
注意:实际中,笔者没像网上的人那样,直接把 open-jdk 删的不要不要的。我只是将 Oracle 的 jdk 的环境变量添加到原有系统变量 $PATH 的之前(路径搜索从前向后,搜索到就停止啦~),具体如下:
export JAVA_HOME=/home/hadoop/softwares/jdk1.7.0_71
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
Note:本博客采用 Maven 方式进行编译,其他的编译方式可参考该博文《Spark1.0.0 源码编译和部署包生成》:http://blog.csdn.net/u011414200/article/details/49422941
进入 spark 1.4.1 源码目录下,编译之前的目录结构:
然后编译:
mvn -Dhadoop.version=2.5.0-cdh5.3.2 -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
但笔者希望将输出结果不仅在屏幕上显示,同时也希望保存到文档中,于是命令为(笔者就用这个):
mvn -Dhadoop.version=2.5.0-cdh5.3.2 -Pyarn -Phive -Phive-thriftserver -DskipTests clean package | tee building.txt
题外话:其实好像用 cdh 版本的只要写以下编译语句就可以了(笔者未考证)
mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
注意的是 hadoop version 和 scala 的版本设置成对应的版本。
Note:
ls /home/hadoop/softwares/spark-1.4.1/network/yarn/target
ls /home/hadoop/softwares/spark-1.4.1/assembly/target/scala-2.10
笔者将其拖入 windows 下,用解压工具打开 see 了下:
在 org 文件夹下:
该文件夹下的文件:
这就说明了编译成功了。
然后在源码目录下面 make-distribution.sh ,可以用来打 二进制bin包:
Note:运行这个命令,笔者瞬间觉得自己SB了,不用 mvn ,好像直接 ./make-distribution 就 OK 了,因为 make 自带 Maven 编译。
./make-distribution.sh --name custom-spark --skip-java-test --tgz -Pyarn -Dhadoop.version=2.5.0-cdh5.3.2 -Dscala-2.10.4 -Phive -Phive-thriftserver
上述命令中 “–name custom-spark” 还有待商榷,貌似应该是 “hadoop-version”。
笔者所用命令为:
./make-distribution.sh --name cdh5.3.2 --skip-java-test --tgz -Pyarn -Dhadoop.version=2.5.0-cdh5.3.2 -Dscala-2.10.4 -Phive -Phive-thriftserver | tee building_distribution.txt
最后,它提示 (Y/N),笔者小心翼翼地选择了 Y,然后就进入漫长的编译阶段…
最终经历了种种困难后,终于成功编译了,如下图:
然后在该目录下:
这个部署包 322 M 大小
在该目录下,生成了 spark-1.4.1-bin-cdh5.3.2.tgz 文件,322M 大小(后记:经初步检测,能正常使用),到此,笔者编译就告一段落了。
Q1: warning: [options] bootstrap class path not set in conjunction with -source 1.6
原因:
This is not Ant but the JDK’s javac emitting the warning.
If you use Java 7’s javac and -source for anything smaller than 7 javac warns you you should also set the bootstrap classpath to point to an older rt.jar - because this is the only way to ensure the result is usable on an older VM.
https://blogs.oracle.com/darcy/entry/bootclasspath_older_source
This is only a warning, so you could ignore it and even suppress it with
value="-Xlint:-options"/>
Alternatively you really install an older JVM and adapt your bootclasspath accordingly (you need to include rt.jar, not the bin folder)
原文链接:Ant javac task errs out: [javac] warning: [options] bootstrap class path not set in conjunction with -source 1.6
解决办法:忽略不管呗~
Q2:编译中断失败 (compile failed. CompileFailed)
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-sql_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed -> [Help 1]
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-sql_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed -> [Help 1]
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-core_2.10: Execution scala-compile-first of golchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed -> [Help 1]
[WARNING] The requested profile “hive-” could not be activated because it does not exist.
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-mllib_2.10: Exeoal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed -> [Help 1]
原因:
解决办法:
mvn -rf :spark-sql_2.10
// 从失败的地方(比如 spark-sql_2.10 )开始编译./make-distribution.sh --name cdh5.3.2 --skip-java-test --tgz -Pyarn -Dhadoop.version=2.5.0-cdh5.3.2 -Dscala-2.10.4 -Phive -Phive-thriftserver -rf :spark-sql_2.10
<dependency>
<groupId>net.alchim31.mavengroupId>
<artifactId>scala-maven-pluginartifactId>
<version>3.2.0version>
dependency>
Q3: spark-repl_2.10 的 MissingRequirementError
[ERROR] error while loading , error in opening zip file
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-repl_2.10: wrap: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-repl_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed.
Google 到的困难原因:
回答一
This error is actually an error from scalac, not a compile error from the code. It sort of sounds like it has not been able to download scala dependencies. Check or maybe recreate your environment.
回答二
This error is very misleading, it actually has nothing to do with scala.runtime or the compiler mirror: this is the error you get when you have a faulty JAR file on your classpath.
Sadly, there is no way from the error (even with -Ydebug) to tell exactly which file. You can run scala with -Ylog-classpath, it will output a lot of classpath stuff, including the exact classpath used (look for “[init] [search path for class files:”). Then I guess you will have to go through them to check if they are valid or not.
I recently tried to improve that (SI-5463), at least to get a clear error message, but couldn’t find a satisfyingly clean way to do this…
回答三
I have checked to ensure that in my class path that ALL jars from SCALA_HOME/lib/ are included
As we figured out at #scala, the documentation was missing the fact that one needs to provide the -Dscala.usejavacp=true argument to the JVM command that invokes scalac. After that everything worked fine, and I updated the docs: http://docs.scala-lang.org/overviews/macros/overview.html#debugging_macros.
Q4: 其他潜在的问题
为了防止Spark(1.4.1)与Hadoop(2.5.0)所使用的Protocol Buffers版本不一致会造成不能正确读取HDFS文件, 所以需要对pom.xml进行相应修改。
<protobuf.version>2.5.0protobuf.version>
《Spark1.0.0 源码编译和部署包生成》:http://blog.csdn.net/u011414200/article/details/49422941
《spark1.4.0基于yarn的安装心得体会 》:http://blog.csdn.net/xiao_jun_0820/article/details/46561097
目前线上用的是cdh5.3.2中内嵌的spark1.2.0版本,该版本BUG还是蛮多的,尤其是一些spark sql的BUG,简直不能忍。spark1.4.0新出的支持SparkR,其他用R的同时很期待试用该版本看看sparkR好不好用,于是乎打算升级一下spark的版本。
《CDH5.1.0编译spark-assembly包来支持hive 》:http://blog.csdn.net/aaa1117a8w5s6d/article/details/44307207
maven的配置文件apache-maven-3.2.5/conf/settings.xml 增加私服地址,同时提供测试代码
《用Maven编译Spark 1.0.0源码以错误解决》:http://www.iteblog.com/archives/1038
《Open JDK导致的Maven Build出错 》 :http://blog.csdn.net/mydeman/article/details/6234750
不推荐大家使用OPEN-JDK
《用SBT编译Spark的WordCount程序》:http://www.aboutyun.com/thread-8587-1-1.html
有空再搞…