HA下Spark集群工作原理(DT大数据梦工厂)

Spark高可用HA实战

Spark集群工作原理详解

资源主要指内存、CPU

spacer.gif

如果是单点的话,Master如果出现故障,则集群不能对外工作

Spark通过Zookeeper做HA,一般做HA是一个active级别,standby

active就是当前工作

standby是随时准备active的挂了之后会切换成为active级别

以前一般是2台机器,一个active,一个standby

现在一般是3台机器,一个active,两个standby,甚至3台以上

Zookeeper包含了哪些内容:所有的Woker、Driver、Application,当active挂了之后,Zookeeper会选取一台standby作为leader然后恢复Woker、Driver、Application,这些都完成之后,leader才会编程covering,才能编程active,只有当它变成active之后才能对外继续提供服务,进行正常的作业提交

一个集群的运行会不会因为Master的切换影响业务程序的运行?粗粒度下不会,因为程序在运行之前,已经向Master 申请过资源了,之后的作业只是worker和excutor之间的交互,和Master没关系,这个是切换之前已经分配好资源了。这个是粗粒度的情况下。细粒度下会。

细粒度弊端,任务启动非常慢。正常一般大数据处理,实际情况下,一般都是粗粒度。

开始动手:

==========下载Zookeeper============

http://zookeeper.apache.org/

下载好之后解压到/usr/local/ 下,然后配置环境变量

export ZOOKEEPER_HOME=/usr/local/zookeeper-3.4.6

export PATH=$PATH:${ZOOKEEPER_HOME}/bin

zookeeper是单独安装的

因为是做分布式的,所以zookeeper要放到多台机器上

我们把它放到Worker1、Worker2中

进入到zookeeper下,创建data和logs两个目录

root@Master:/usr/local/zookeeper-3.4.6# mkdir data

root@Master:/usr/local/zookeeper-3.4.6# mkdir logs

root@Master:/usr/local/zookeeper-3.4.6/conf# vi zoo_sample.cfg

dataDir这个必须修改,不然重启之后数据就会被删除

root@Master:/usr/local/zookeeper-3.4.6/conf# cp zoo_sample.cfg zoo.cfg

root@Master:/usr/local/zookeeper-3.4.6/conf# vi zoo.cfg

修改(做3台机器的集群)

dataDir=/usr/local/zookeeper-3.4.6/data

dataLogDir=/usr/local/zookeeper-3.4.6/logs

server.0=Master:2888:3888

server.1=Worker1:2888:3888

server.2=Worker2:2888:3888

root@Master:/usr/local/zookeeper-3.4.6/conf# cd ../data/

为机器编号

root@Master:/usr/local/zookeeper-3.4.6/data# echo 0>myid

root@Master:/usr/local/zookeeper-3.4.6/data# echo 0>>myid

root@Master:/usr/local/zookeeper-3.4.6/data# ls

myid

root@Master:/usr/local/zookeeper-3.4.6/data# cat myid

root@Master:/usr/local/zookeeper-3.4.6/data# vi myid

root@Master:/usr/local/zookeeper-3.4.6/data# cat myid

0

root@Master:/usr/local/zookeeper-3.4.6/data#

到这个时候一台机器已经配置好了

Zookeeper配置好集群之后,当spark重启之后,可以把上次集群的信息全部同步过来。不然spark重启就一切重头再来。

root@Master:/usr/local# scp -r ./zookeeper-3.4.6 root@Worker1:/usr/local

root@Master:/usr/local# scp -r ./zookeeper-3.4.6 root@Worker2:/usr/local

然后分别进去Worker1和Worker2更改myid为1和2

root@Master:/usr/local/zookeeper-3.4.6# cd bin

root@Master:/usr/local/zookeeper-3.4.6/bin# ls

README.txt    zkCli.cmd  zkEnv.cmd  zkServer.cmd

zkCleanup.sh  zkCli.sh   zkEnv.sh   zkServer.sh

root@Master:/usr/local/zookeeper-3.4.6/bin# cd bin

bash: cd: bin: No such file or directory

root@Master:/usr/local/zookeeper-3.4.6/bin# zkServer.sh start

然后三台机器分别启动

下一步就是让Spark支持zookeeper下HA

到spark-env.sh中配置

root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/conf# vi spark-env.sh

//整个集群的状态的维护和恢复都是通过zookeeper的,状态信息都是

export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=Master:2181,Worker1:2181,Worker2:2181 -Dspark.deploy.zookeeper.dir=/spark"

spacer.gif

已经配置集群了,所以还要注释

#export SPARK_MASTER_IP=Master

root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/conf# scp spark-env.sh root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh

spark-env.sh                                  100%  500     0.5KB/s   00:00   

root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/conf# scp spark-env.sh root@Worker2:/usr/local/spark-1.6.0-bin-hadoop2.6/conf/spark-env.sh

spark-env.sh                                  100%  500     0.5KB/s   00:00

然后就是启动Spark,./start-all.sh

spacer.gif

spacer.gif

为什么Master里面有Master进程,Worker1、2没有?

root@Master:/usr/local/spark-1.6.0-bin-hadoop2.6/sbin# cd ../conf/

root@Master:/usr/local/spark-1.6.0-bin-hadoop2.6/conf# cat slaves

Master

Worker1

Worker2

由于我们只有一个master,所以必须到Worker1和2上启动master

root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/bin# cd $SPARK_HOME/sbin

root@Worker1:/usr/local/spark-1.6.0-bin-hadoop2.6/sbin# ./start-master.sh

starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-Worker1.out

root@Worker2:/usr/local/zookeeper-3.4.6/bin# cd $SPARK_HOME/sbin

root@Worker2:/usr/local/spark-1.6.0-bin-hadoop2.6/sbin# ./start-master.sh

starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-1.6.0-bin-hadoop2.6/logs/spark-root-org.apache.spark.deploy.master.Master-1-Worker2.out

http://master:8080/看情况

spacer.gif

http://worker1:8080/

spacer.gif

没有worker,是standby状态,完全是备胎状态

http://worker2:8080/

spacer.gif

没有worker,是standby状态,完全是备胎状态

用spark-shell来实验HA

root@Master:/usr/local/spark-1.6.0-bin-hadoop2.6/conf# cd $SPARK_HOME/bin

root@Master:/usr/local/spark-1.6.0-bin-hadoop2.6/bin# ./spark-shell --master spark://Master:7077,Worker1:7077,Worker2:7077

启动日志有注册:

16/02/04 21:12:29 INFO client.AppClient$ClientEndpoint: Connecting to master spark://Master:7077...

16/02/04 21:12:29 INFO client.AppClient$ClientEndpoint: Connecting to master spark://Worker1:7077...

16/02/04 21:12:29 INFO client.AppClient$ClientEndpoint: Connecting to master spark://Worker2:7077...

但实际只跟active的那台机器交互

来cd $SPARK_HOME/sbin

./stop-master.sh

看到spark-shell里面有动静了

scala> 16/02/04 21:17:11 WARN client.AppClient$ClientEndpoint: Connection to Master:7077 failed; waiting for master to reconnect...

16/02/04 21:17:11 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...

16/02/04 21:17:11 WARN client.AppClient$ClientEndpoint: Connection to Master:7077 failed; waiting for master to reconnect...

16/02/04 21:17:44 INFO client.AppClient$ClientEndpoint: Master has changed, new master is at spark://Worker1:7077

Worker1变成Active花了半分钟,不会是马上变的,编程Active视集群规模而言

http://master:8080/  连不上了

http://worker1:8080/

spacer.gif

http://worker2:8080/ 还是刚才那样

运行下PI计算

./spark-submit  --class org.apache.spark.examples.SparkPi --master spark://Master:7077,Worker1:7077,Worker2:7077 ../lib/spark-examples-1.6.0-hadoop2.6.0.jar 100

然后启动Master机器的master

此时Master机器变成了standby模式

http://master:8080/

spacer.gif

一般生产环境100、200G内存

王家林老师名片:

中国Spark第一人

新浪微博:http://weibo.com/ilovepains

微信公众号:DT_Spark

博客:http://blog.sina.com.cn/ilovepains

手机:18610086859

QQ:1740415547

邮箱:[email protected]


本文出自 “一枝花傲寒” 博客,谢绝转载!

你可能感兴趣的:(spark,工作原理,HA)