本文是基于hadoop2.6.0的分布式环境搭建spark1.6.0的分布式集群。
hadoop2.6.0分布式集群可参考: http://kevin12.iteye.com/blog/2273532
1.解压spark的包,tar -zxvf spark-1.6.0-bin-hadoop2.6.tgz,并将其移到/usr/local/spark目录下面;
在~/.bashrc文件中配置spark的环境变量,保存并退出,执行source ~/.bashrc使之生效;
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_60
export JRE_HOME=${JAVA_HOME}/jre
export CLASS_PATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export SCALA_HOME=/usr/local/scala/scala-2.10.4
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_HOME}/lib/native
export HADOOP_OPTS="-Djava.library.path=${HADOOP_HOME}/lib"
export SPARK_HOME=/usr/local/spark/spark-1.6.0-bin-hadoop2.6
export PATH=.:${JAVA_HOME}/bin:${SCALA_HOME}/bin:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:$PATH
然后将运行下面命令,将master1上的.bashrc拷贝到四台worker上。
root@master1:~# scp ~/.bashrc root@worker1:~/
root@master1:~# scp ~/.bashrc root@worker2:~/
root@master1:~# scp ~/.bashrc root@worker3:~/
root@master1:~# scp ~/.bashrc root@worker4:~/
分别在四台worker上执行source ~/.bashrc 使配置生效。
2.配置spark环境
2.1 将conf下面的spark-env.sh.template拷贝一份到spark-env.sh中,并编辑配置。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp spark-env.sh.template spark-env.sh
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vim spark-env.sh
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_60
export export SCALA_HOME=/usr/local/scala/scala-2.10.4
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.6.0
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SPARK_MASTER_IP=master1
export SPARK_WORKER_MEMORY=2g
export SPARK_EXECUTOR_MEMORY=2g
export SPARK_DRIVER_MEMORY=2g
export SPARK_WORKDER_CORES=4
说明:HADOOP_CONF_DIR配置是让spark运行在yarn模式下,非常关键。
SPARK_WORKER_MEMORY,SPARK_EXECUTOR_MEMORY,SPARK_DRIVER_MEMORY,SPARK_WORKDER_CORES 根据自己的集群情况进行配置。
配置slavas:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp slaves.template slaves
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# vim slaves
# A Spark Worker will be started on each of the machines listed below.
worker1
worker2
worker3
worker4
配置spark-defaults.conf:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/conf# cp spark-defaults.conf.template spark-defaults.conf
#添加下面的配置:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.eventLog.enabled true
spark.eventLog.dir hdfs://master1:9000/historyserverforSpark
spark.yarn.historyServer.address master1:18080
spark.history.fs.logDirectory hdfs://master1:9000/historyserverforSpark
说明:spark.eventLog.enabled打开后并配置了spark.eventLog.dir 那么在集群运行时,会将所有运行的日志信息都记录下来,方便运维。
将master1中配置的spark通过scp命令同步到worker上面。
root@master1:/usr/local# scp -r spark/ root@worker1:/usr/local/
root@master1:/usr/local# scp -r spark/ root@worker2:/usr/local/
root@master1:/usr/local# scp -r spark/ root@worker3:/usr/local/
root@master1:/usr/local# scp -r spark/ root@worker4:/usr/local/
然后查看worker上面的/usr/local/目录,确认一下是否将spark拷贝过来。
在hdfs上创建一个historyserverforSpark目录
root@master1:/usr/local# hdfs dfs -mkdir /historyserverforSpark
16/01/24 07:46:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local# hdfs dfs -ls /
16/01/24 07:46:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - root supergroup 0 2016-01-24 07:46 /historyserverforSpark
可用通过浏览器查看我们创建的目录。
3.启动spark
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-master1.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker4: failed to launch org.apache.spark.deploy.worker.Worker:
worker4: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker1: failed to launch org.apache.spark.deploy.worker.Worker:
worker1: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker3: failed to launch org.apache.spark.deploy.worker.Worker:
worker3: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker2: failed to launch org.apache.spark.deploy.worker.Worker:
worker2: full log in /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out
从上面看worker节点没启动成功,查看日志发现没有报错,原因是虚拟机自身的问题,但是具体哪里问题还不清楚;
通过命令./sbin/stop-all.sh停止spark集群,将所有节点中/usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs目录下的日志全部删除,再次启动,集群启动成功。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-master1.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-worker4.out
通过jps命令确认是否启动了Master和Worker进程:
master1上的如下:
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# jps
4551 ResourceManager
7255 Jps
7143 Master
4379 SecondaryNameNode
4175 NameNode
worker1上的如下:
root@worker1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs# jps
4528 Worker
2563 DataNode
2713 NodeManager
4606 Jps
通过浏览器访问http://192.168.112.130:8080/查看控制台,有4个节点。
到此为止,spark集群已经搭建完成!!!
启动history-server进程,记录集群的运行情况,即使重启后也能恢复之前的运行信息。
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/sbin# ./start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /usr/local/spark/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.history.HistoryServer-1-master1.out
通过http://192.168.112.130:18080/查看History Server
运行例子:计算pi
位置:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/examples/src/main/scala/org/apache/spark/examples
源码如下:
// scalastyle:off println
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
// scalastyle:on println
设置并行度为5000,在运行的过程中,方便通过浏览器查看;
root@master1:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/bin# ./spark-submit --class org.apache.spark.examples.SparkPi --master spark://master1:7077 ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 5000
通过浏览器查看任务:
运行结果:Pi is roughly 3.14156656
从打印日志上看为什么程序启动的这么快?
答:因为spark使用了Coarse Grained(粗粒度);
粗粒度就是在程序启动初始化的那一个时刻就为分配资源,后续程序计算时直接使用资源就行了,不需要每次计算时再分配资源。
粗粒度适合于作业非常多,而且需要资源复用时。粗粒度的一个缺点是:当并行很多时,一个作业运行时间很长,而其他作业运行很短,就会浪费资源。
细粒度就是是指程序计算时才分配资源,计算完成后立即回收资源。
通过history Server来看下运行情况:
王家林:中国Spark第一人,Spark亚太研究院院长和首席专家
DT大数据梦工厂
新浪微博:http://weibo.com.ilovepains/
手机:18610086859
QQ:1740415547
联系邮箱
[email protected]