hadoop2.6.0分布式集群可参考: http://kevin12.iteye.com/blog/2273532
1.解压spark的包,tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz,并将其移到/usr/local/spark目录下面;
在~/.bashrc文件中配置spark的环境变量,保存并退出,执行source ~/.bashrc使之生效;
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_60 export JRE_HOME=${JAVA_HOME}/jre export CLASS_PATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib export SCALA_HOME=/usr/local/scala/scala-2.10.4 export HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0 export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib" export SPARK_HOME=/usr/local/spark/spark-1.6.0-bin-hadoop2.6 export PATH=.:${JAVA_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:$PATH
然后将运行下面命令,将master1上的.bashrc拷贝到四台worker上。
root@master1:~# scp ~/.bashrc root@worker1:~/ root@master1:~# scp ~/.bashrc root@worker2:~/ root@master1:~# scp ~/.bashrc root@worker3:~/ root@master1:~# scp ~/.bashrc root@worker4:~/
分别在四台worker上执行source ~/.bashrc 使配置生效。
2.配置spark环境
2.1 将conf下面的spark-env.sh.template拷贝一份到spark-env.sh中,并编辑配置。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp spark-env.sh.template spark-env.sh root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vim spark-env.sh export JAVA_HOME=/usr/local/jdk/jdk1.8.0_60 export export SCALA_HOME=/usr/local/scala/scala-2.10.4 export HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0 export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export SPARK_MASTER_IP=master1 export SPARK_WORKER_MEMORY=2g export SPARK_EXECUTOR_MEMORY=2g export SPARK_DRIVER_MEMORY=2g export SPARK_WORKDER_CORES=4
说明:HADOOP_CONF_DIR配置是让spark运行在yarn模式下,非常关键。
SPARK_WORKER_MEMORY,SPARK_EXECUTOR_MEMORY,SPARK_DRIVER_MEMORY,SPARK_WORKDER_CORES 根据自己的集群情况进行配置。
配置slavas:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp slaves.template slaves root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vim slaves # A Spark Worker will be started on each of the machines listed below. worker1 worker2 worker3 worker4
配置spark-defaults.conf:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp spark-defaults.conf.template spark-defaults.conf #添加下面的配置: spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" spark.eventLog.enabled true spark.eventLog.dir hdfs://master1:9000/historyserverforSpark spark.yarn.historyServer.address master1:18080 spark.history.fs.logDirectory hdfs://master1:9000/historyserverforSpark
说明:spark.eventLog.enabled打开后并配置了spark.eventLog.dir 那么在集群运行时,会将所有运行的日志信息都记录下来,方便运维。
将master1中配置的spark通过scp命令同步到worker上面。
root@master1:/usr/local# scp -r spark/ root@worker1:/usr/local/ root@master1:/usr/local# scp -r spark/ root@worker2:/usr/local/ root@master1:/usr/local# scp -r spark/ root@worker3:/usr/local/ root@master1:/usr/local# scp -r spark/ root@worker4:/usr/local/
然后查看worker上面的/usr/local/目录,确认一下是否将spark拷贝过来。
在hdfs上创建一个historyserverforSpark目录
root@master1:/usr/local# hdfs dfs -mkdir /historyserverforSpark
16/01/24 07:46:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local# hdfs dfs -ls /
16/01/24 07:46:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - root supergroup 0 2016-01-24 07:46 /historyserverforSpark
可用通过浏览器查看我们创建的目录。
3.启动spark
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-all.sh starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-master1.out worker4: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out worker4: failed to launch org.apache.spark.deploy.worker.Worker: worker4: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out worker1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out worker3: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out worker1: failed to launch org.apache.spark.deploy.worker.Worker: worker1: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out worker3: failed to launch org.apache.spark.deploy.worker.Worker: worker3: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out worker2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out worker2: failed to launch org.apache.spark.deploy.worker.Worker: worker2: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out
从上面看worker节点没启动成功,查看日志发现没有报错,原因是虚拟机自身的问题,但是具体哪里问题还不清楚;
通过命令./sbin/stop-all.sh停止spark集群,将所有节点中/usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs目录下的日志全部删除,再次启动,集群启动成功。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-all.sh starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-master1.out worker3: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out worker2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out worker1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out worker4: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out
通过jps命令确认是否启动了Master和Worker进程:
master1上的如下:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# jps 4551 ResourceManager 7255 Jps 7143 Master 4379 SecondaryNameNode 4175 NameNode
worker1上的如下:
root@worker1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs# jps 4528 Worker 2563 DataNode 2713 NodeManager 4606 Jps
通过浏览器访问http://192.168.112.130:8080/查看控制台,有4个节点。
到此为止,spark集群已经搭建完成!!!
启动history-server进程,记录集群的运行情况,即使重启后也能恢复之前的运行信息。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-history-server.sh starting org.apache.spark.deploy.history.HistoryServer, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.history.HistoryServer-1-master1.out
通过http://192.168.112.130:18080/查看History Server
运行例子:计算pi
位置:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/examples/src/main/scala/org/apache/spark/examples
源码如下:
// scalastyle:off println package org.apache.spark.examples import scala.math.random import org.apache.spark._ /** Computes an approximation to pi */ object SparkPi { def main(args: Array[String]) { val conf = new SparkConf().setAppName("Spark Pi") val spark = new SparkContext(conf) val slices = if (args.length > 0) args(0).toInt else 2 val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow val count = spark.parallelize(1 until n, slices).map { i => val x = random * 2 - 1 val y = random * 2 - 1 if (x*x + y*y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) spark.stop() } } // scalastyle:on println
设置并行度为5000,在运行的过程中,方便通过浏览器查看;
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://master1:7077 ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 5000
通过浏览器查看任务:
运行结果:Pi is roughly 3.14156656
从打印日志上看为什么程序启动的这么快?
答:因为spark使用了Coarse Grained(粗粒度);
粗粒度就是在程序启动初始化的那一个时刻就为分配资源,后续程序计算时直接使用资源就行了,不需要每次计算时再分配资源。
粗粒度适合于作业非常多,而且需要资源复用时。粗粒度的一个缺点是:当并行很多时,一个作业运行时间很长,而其他作业运行很短,就会浪费资源。
细粒度就是是指程序计算时才分配资源,计算完成后立即回收资源。
通过history Server来看下运行情况: