Spark On Yarn编译,部署和测试

 

环境

Spark On Yarn环境准备:

Spark:0.9.1 release。注意要选择relase版本(不是incubating版),踩到的坑会比较少。下载页面 http://spark.apache.org/downloads.html 

Hadoop:2.0.0-cdh4.2.1。MRv2(Yarn)

环境:cygwin(Git console also works)

编译机器:内存够大(2G以上吧)

 

如果带宽够大,你就成功30了。如果你有SSD,你就成功80%了。

 

准备

下载后解压spark,然后SparkBuild.scala做一些修改。SparkBuild.scala是Spark使用的build系统sbt的配置文件。相当于maven的pom文件。

 

修改project/SparkBuild.scala:

1) force protobuf dependency:这个版本是hadoop 2.0.0-cdh4.2.1用的版本。不写定版本会有问题。

    libraryDependencies ++= Seq(

   "com.google.protobuf"      % "protobuf-java"    % "2.4.0a" force(),

2)修改maven repo。搜索resolvers ++= Seq,对于有local maven repo的情况下,修改为本地的repo。

3)修改maven local repo地址。如果之前用过maven了。可以指定local repo

        resolvers ++= Seq(Resolver.file("Local Maven Repo", file("D:/repository"))),

 

编译

进入Spark Home,执行

SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 SPARK_YARN=true sbt/sbt assembly

 

编译很慢,约一个小时。如果dependency需要下载就不好说了。

 

Tips:如果有问题,可以尝试proxy:export HTTP_PROXY=http://127.0.0.1:port

 

拷贝

把需要的内容拷贝到hadoop集群上。比如spark目录。这个目录包含下面的目内容:

bin和conf下的所有文件:

./bin/run-example

./bin/spark-shell

./bin/pyspark

./bin/compute-classpath.sh

./bin/spark-class

./conf/fair-scheduler.xml

./conf/metrics.properties.template

./conf/spark-env.sh

./conf/fairscheduler.xml.template

./conf/log4j.properties

./conf/slaves

 

和spark & example:

./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.0-cdh4.2.1.jar

./examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar

 

分发

可以参考下面的脚本分发

#!/bin/sh
#Author: Meng Zang <[email protected]>
#Date: 2014-04-11

NODES_ADDRESS=~/spark/nodes

TMP_SPARK=~/spark

SPARK=/usr/local/spark

sync() 
{
#    sudo mkdir ${SPARK}
#    sudo cp $TMP_SPARK/* $SPARK

    for node in $(cat $NODES_ADDRESS) 
    do
        ssh -p 1022 $node "mkdir -p $TMP_SPARK"
        ssh -p 1022 $node "sudo mkdir -p $SPARK"
        rsync -vaz --delete -e 'ssh -p 1022' $TMP_SPARK/ $node:$TMP_SPARK
        ssh -p 1022 $node "sudo cp $TMP_SPARK/* $SPARK"
    done  
}

case "$1" in
   'sync')
     sync
     ;;
   *)
     echo "Usage: $0 {sync}"
     exit 1
esac

exit 0

 

执行example

 

# step into spark dir

export HADOOP_CONF_DIR=/etc/hadoop/conf 
export YARN_CONF_DIR=/etc/hadoop/conf 

SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.0-cdh4.2.1.jar \
    ./bin/spark-class org.apache.spark.deploy.yarn.Client \
    --jar examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar \
     --class org.apache.spark.examples.SparkPi \
     --args yarn-standalone \
     --num-workers 3 \
     --master-memory 4g \
     --worker-memory 2g \
     --worker-cores 1

 

执行过程中,可以去MRv2的web console查看。也可以在控制台看到执行的report

 

 

14/04/11 16:43:44 INFO Client: Application report from ASM: 
	 application identifier: application_1397119957695_0037
	 appId: 37
	 clientToken: null
	 appDiagnostics: 
	 appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com
	 appQueue: default
	 appMasterRpcPort: 0
	 appStartTime: 1397205809221
	 yarnAppState: RUNNING
	 distributedFinalState: UNDEFINED
	 appTrackingUrl: SVR2368HP360.hadoop.test.sh.ctriptravel.com:8088/proxy/application_1397119957695_0037/
	 appUser: op1
14/04/11 16:43:45 INFO Client: Application report from ASM: 
	 application identifier: application_1397119957695_0037
	 appId: 37
	 clientToken: null
	 appDiagnostics: 
	 appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com
	 appQueue: default
	 appMasterRpcPort: 0
	 appStartTime: 1397205809221
	 yarnAppState: RUNNING
	 distributedFinalState: UNDEFINED
	 appTrackingUrl: SVR2368HP360.hadoop.test.sh.ctriptravel.com:8088/proxy/application_1397119957695_0037/
	 appUser: op1
14/04/11 16:43:46 INFO Client: Application report from ASM: 
	 application identifier: application_1397119957695_0037
	 appId: 37
	 clientToken: null
	 appDiagnostics: 
	 appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com
	 appQueue: default
	 appMasterRpcPort: 0
	 appStartTime: 1397205809221
	 yarnAppState: FINISHED
	 distributedFinalState: SUCCEEDED
	 appTrackingUrl: 
	 appUser: op1

 

 

 

 

 

 

你可能感兴趣的:(spark)