北京时间5月30日20点多。Spark1.0.0发布。详见 http://spark.apache.org/releases/spark-release-1-0-0.html
在官方发布的版本中,默认支持的是hadoop 2.2,不是最新的hadoop 2.4.0
作为尝鲜一族,现在尝试将spark 进行编译,支持hadoop 2.4.0。
1、源码编译环境
软件 |
版本 |
地址 |
Centos |
6.4 X64 |
http://mirrors.163.com/centos/6.4/isos/x86_64/CentOS-6.4-x86_64-bin-DVD1.iso |
maven |
3.2.1 |
http://mirrors.hust.edu.cn/apache/maven/maven-3/3.2.1/binaries/apache-maven-3.2.1-bin.tar.gz |
java |
1.7 |
http://www.oracle.com/technetwork/java/javase/downloads/index.html |
Hadoop |
2.4.0 |
http://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.4.0/ |
Spark |
1.0.0 |
http://mirrors.cnnic.cn/apache/spark/spark-1.0.0/spark-1.0.0.tgz |
2、环境设置
/etc/profile
#set java environment
JAVA_HOME=/opt/jdk1.7.0_55
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export JAVA_HOME CLASSPATH PATH
#set hadoop
export HADOOP_HOME=/opt/hadoop-2.4.0
export PATH=$PATH:$HADOOP_HOME/bin
#set MAVEN
export MAVEN_HOME=/opt/apache-maven-3.2.1
export PATH=${PATH}:${MAVEN_HOME}/bin
export MAVEN_CMD=$MAVEN_HOME/bin/mvn
#set SCALA
export SCALA_HOME=/opt/scala-2.10.4
export PATH=$PATH:$SCALA_HOME/bin
3、编译
解压源码,在根去根目录下执行以下命令
./make-distribution.sh --hadoop 2.4.0--with-yarn --tgz --with-hive
几个重要参数
--hadoop :指定Hadoop版本
--with-yarn yarn支持是必须的
--with-hive 读取hive数据也是必须的,反正我很讨厌Shark,以后开发们可以在Spark上自己封装SQL&HQL客户端,也是个不错的选择。
# --tgz: Additionally creates spark-$VERSION-bin.tar.gz
# --hadoop VERSION: Builds againstspecified version of Hadoop.
# --with-yarn: Enables support forHadoop YARN.
# --with-hive: Enable support forreading Hive tables.
# --name: A moniker for the releasetarget. Defaults to the Hadoopverison.
经过漫长的等待,在源码跟目录下会生成一个tgz压缩包
编译成功
[WARNING] See http://docs.codehaus.org/display/MAVENUSER/Shade+Plugin
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .......................... SUCCESS [ 1.505 s]
[INFO] Spark Project Core ................................ SUCCESS [01:52 min]
[INFO] Spark Project Bagel ............................... SUCCESS [ 13.727 s]
[INFO] Spark Project GraphX .............................. SUCCESS [03:42 min]
[INFO] Spark Project ML Library .......................... SUCCESS [06:31 min]
[INFO] Spark Project Streaming ........................... SUCCESS [ 47.049 s]
[INFO] Spark Project Tools ............................... SUCCESS [ 7.437 s]
[INFO] Spark Project Catalyst ............................ SUCCESS [ 35.608 s]
[INFO] Spark Project SQL ................................. SUCCESS [01:08 min]
[INFO] Spark Project Hive ................................ SUCCESS [04:23 min]
[INFO] Spark Project REPL ................................ SUCCESS [ 31.167 s]
[INFO] Spark Project YARN Parent POM ..................... SUCCESS [ 34.463 s]
[INFO] Spark Project YARN Stable API ..................... SUCCESS [ 18.475 s]
[INFO] Spark Project Assembly ............................ SUCCESS [01:06 min]
[INFO] Spark Project External Twitter .................... SUCCESS [ 14.859 s]
[INFO] Spark Project External Kafka ...................... SUCCESS [01:28 min]
[INFO] Spark Project External Flume ...................... SUCCESS [ 19.153 s]
[INFO] Spark Project External ZeroMQ ..................... SUCCESS [ 24.138 s]
[INFO] Spark Project External MQTT ....................... SUCCESS [ 22.316 s]
[INFO] Spark Project Examples ............................ SUCCESS [03:14 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 27:58 min
[INFO] Finished at: 2014-06-08T13:07:47+08:00
[INFO] Final Memory: 100M/914M
[INFO] ------------------------------------------------------------------------
You have new mail in /var/spool/mail/root
查看一下
[root@NameNode spark-1.0.0]# ll
total 183784
drwxrwxr-x 4 1000 1000 4096 Jun 8 13:00 assembly
drwxrwxr-x 4 1000 1000 4096 Jun 8 12:41 bagel
drwxrwxr-x 2 1000 1000 4096 May 26 14:47 bin
-rw-rw-r-- 1 1000 1000 281471 May 26 14:47 CHANGES.txt
drwxrwxr-x 2 1000 1000 4096 May 26 14:47 conf
drwxrwxr-x 4 1000 1000 4096 Jun 8 12:39 core
drwxrwxr-x 3 1000 1000 4096 May 26 14:47 data
drwxrwxr-x 4 1000 1000 4096 May 26 14:47 dev
drwxr-xr-x 9 root root 4096 Jun 8 13:07 dist
drwxrwxr-x 3 1000 1000 4096 May 26 14:47 docker
drwxrwxr-x 7 1000 1000 4096 May 26 14:47 docs
drwxrwxr-x 4 1000 1000 4096 May 26 14:47 ec2
drwxrwxr-x 4 1000 1000 4096 Jun 8 13:07 examples
drwxrwxr-x 7 1000 1000 4096 May 26 14:47 external
drwxrwxr-x 4 1000 1000 4096 May 26 14:47 extras
drwxrwxr-x 5 1000 1000 4096 Jun 8 12:45 graphx
drwxr-xr-x 3 root root 4096 Jun 8 12:59 lib_managed
-rw-rw-r-- 1 1000 1000 29983 May 26 14:47 LICENSE
-rwxrwxr-x 1 1000 1000 8126 May 26 14:47 make-distribution.sh
drwxrwxr-x 5 1000 1000 4096 Jun 8 12:51 mllib
-rw-rw-r-- 1 1000 1000 22559 May 26 14:47 NOTICE
-rw-rw-r-- 1 1000 1000 35121 May 26 14:47 pom.xml
drwxrwxr-x 4 1000 1000 4096 May 26 14:47 project
drwxrwxr-x 6 1000 1000 4096 Jun 8 12:08 python
-rw-rw-r-- 1 1000 1000 4221 May 26 14:47 README.md
drwxrwxr-x 4 1000 1000 4096 Jun 8 12:59 repl
drwxrwxr-x 2 1000 1000 4096 May 26 14:47 sbin
drwxrwxr-x 2 1000 1000 4096 May 26 14:47 sbt
-rw-rw-r-- 1 1000 1000 7703 May 26 14:47 scalastyle-config.xml
-rw-r--r-- 1 root root 187677812 Jun 8 13:07 spark-1.0.0-bin-2.4.0.tgz
drwxrwxr-x 5 1000 1000 4096 May 26 14:47 sql
drwxrwxr-x 4 1000 1000 4096 Jun 8 12:52 streaming
drwxr-xr-x 5 root root 4096 Jun 8 12:39 target
drwxrwxr-x 4 1000 1000 4096 Jun 8 12:52 tools
-rw-rw-r-- 1 1000 1000 805 May 26 14:47 tox.ini
drwxrwxr-x 6 1000 1000 4096 Jun 8 13:00 yarn
You have new mail in /var/spool/mail/root
[root@NameNode spark-1.0.0]# vi /etc/profile
把这个包 spark-1.0.0-bin-2.4.0.tgz 复制到你想部署的目录并解压。
特别注意:只需要把解压包copy到yarn集群中的任意一台。一个节点就够了,不需要在所有节点都部署,除非你需要多个Client节点调用spark作业。
在这里直接给出已编译好的 版本,方便使用。
http://pan.baidu.com/s/1dD9udET