环境
Spark On Yarn环境准备:
Spark:0.9.1 release。注意要选择relase版本(不是incubating版),踩到的坑会比较少。下载页面 http://spark.apache.org/downloads.html
Hadoop:2.0.0-cdh4.2.1。MRv2(Yarn)
环境:cygwin(Git console also works)
编译机器:内存够大(2G以上吧)
如果带宽够大,你就成功30了。如果你有SSD,你就成功80%了。
准备
下载后解压spark,然后SparkBuild.scala做一些修改。SparkBuild.scala是Spark使用的build系统sbt的配置文件。相当于maven的pom文件。
修改project/SparkBuild.scala:
1) force protobuf dependency:这个版本是hadoop 2.0.0-cdh4.2.1用的版本。不写定版本会有问题。
libraryDependencies ++= Seq(
"com.google.protobuf" % "protobuf-java" % "2.4.0a" force(),
2)修改maven repo。搜索resolvers ++= Seq,对于有local maven repo的情况下,修改为本地的repo。
3)修改maven local repo地址。如果之前用过maven了。可以指定local repo
resolvers ++= Seq(Resolver.file("Local Maven Repo", file("D:/repository"))),
编译
进入Spark Home,执行
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 SPARK_YARN=true sbt/sbt assembly
编译很慢,约一个小时。如果dependency需要下载就不好说了。
Tips:如果有问题,可以尝试proxy:export HTTP_PROXY=http://127.0.0.1:port
拷贝
把需要的内容拷贝到hadoop集群上。比如spark目录。这个目录包含下面的目内容:
bin和conf下的所有文件:
./bin/run-example
./bin/spark-shell
./bin/pyspark
./bin/compute-classpath.sh
./bin/spark-class
./conf/fair-scheduler.xml
./conf/metrics.properties.template
./conf/spark-env.sh
./conf/fairscheduler.xml.template
./conf/log4j.properties
./conf/slaves
和spark & example:
./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.0-cdh4.2.1.jar
./examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar
分发
可以参考下面的脚本分发
#!/bin/sh #Author: Meng Zang <[email protected]> #Date: 2014-04-11 NODES_ADDRESS=~/spark/nodes TMP_SPARK=~/spark SPARK=/usr/local/spark sync() { # sudo mkdir ${SPARK} # sudo cp $TMP_SPARK/* $SPARK for node in $(cat $NODES_ADDRESS) do ssh -p 1022 $node "mkdir -p $TMP_SPARK" ssh -p 1022 $node "sudo mkdir -p $SPARK" rsync -vaz --delete -e 'ssh -p 1022' $TMP_SPARK/ $node:$TMP_SPARK ssh -p 1022 $node "sudo cp $TMP_SPARK/* $SPARK" done } case "$1" in 'sync') sync ;; *) echo "Usage: $0 {sync}" exit 1 esac exit 0
执行example
# step into spark dir export HADOOP_CONF_DIR=/etc/hadoop/conf export YARN_CONF_DIR=/etc/hadoop/conf SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.0-cdh4.2.1.jar \ ./bin/spark-class org.apache.spark.deploy.yarn.Client \ --jar examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar \ --class org.apache.spark.examples.SparkPi \ --args yarn-standalone \ --num-workers 3 \ --master-memory 4g \ --worker-memory 2g \ --worker-cores 1
执行过程中,可以去MRv2的web console查看。也可以在控制台看到执行的report
14/04/11 16:43:44 INFO Client: Application report from ASM: application identifier: application_1397119957695_0037 appId: 37 clientToken: null appDiagnostics: appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com appQueue: default appMasterRpcPort: 0 appStartTime: 1397205809221 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: SVR2368HP360.hadoop.test.sh.ctriptravel.com:8088/proxy/application_1397119957695_0037/ appUser: op1 14/04/11 16:43:45 INFO Client: Application report from ASM: application identifier: application_1397119957695_0037 appId: 37 clientToken: null appDiagnostics: appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com appQueue: default appMasterRpcPort: 0 appStartTime: 1397205809221 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: SVR2368HP360.hadoop.test.sh.ctriptravel.com:8088/proxy/application_1397119957695_0037/ appUser: op1 14/04/11 16:43:46 INFO Client: Application report from ASM: application identifier: application_1397119957695_0037 appId: 37 clientToken: null appDiagnostics: appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com appQueue: default appMasterRpcPort: 0 appStartTime: 1397205809221 yarnAppState: FINISHED distributedFinalState: SUCCEEDED appTrackingUrl: appUser: op1