Spark runs on Docker

最近看了不少Docker相关的文档,也做了不少相关的实验。想着能不能在docker上运行spark,然后google了一下,发现

1 https://registry.hub.docker.com/u/amplab/spark-master/ ,https://github.com/amplab/docker-scripts 这里的两个应该是同一个,不过前者没给guide

2 spark官方源码里面提供了docker的子项目

本文采用spark提供的docker子项目来构建Spark集群


Docker的安装,这里不再做叙述。请参考官方文档。


下面我们看一看Docker子项目的目录结构

docker

---------build

---------readme.md

---------spark-test

----------------------build

----------------------readme.md

----------------------base

-------------------------------dockerfile

----------------------master

-------------------------------dockerfile

-------------------------------default_cmd

----------------------worker

-------------------------------dockerfile

-------------------------------default_cmd

根据目录结构可以看出这里是要构造3个image,分别是base, master, worker


docker/build, 这个脚本调用spark-test下面的build脚本,这里只做第一步检查,不用sudo的情况下药可以执行docker的命令

docker images > /dev/null || { echo Please install docker in non-sudo mode. ; exit; }

./spark-test/build

解决方法:

1. 如果还没有docker group就添加一个:

sudo groupadd docker

2.将用户加入该group内。然后退出并重新登录就生效啦。

sudo gpasswd -a ${USER} docker

3.重启docker

sudo service docker restart

大功告成!


docker/spark-test/build,这里的脚本开始调用下层的dockerfile,分别创建image

docker build -t spark-test-base spark-test/base/
docker build -t spark-test-master spark-test/master/
docker build -t spark-test-worker spark-test/worker/

docker/spark-test/readme.md

Spark Docker files usable for testing and development purposes.

These images are intended to be run like so:

	docker run -v $SPARK_HOME:/opt/spark spark-test-master
	docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://<master_ip>:7077

Using this configuration, the containers will have their Spark directories
mounted to your actual `SPARK_HOME`, allowing you to modify and recompile
your Spark source and have them immediately usable in the docker images
(without rebuilding them).

这里需要做一下解释:

在host机器上 

安装Scala-2.10.x版本

安装Spark 1.1.0版本

sudo vi /etc/profile,增加$SPARK_HOME, $SCALA_HOME, $SPARK_BIN, $SCALA_BIN环境变量设置。source /etc/profile

另开一个shell,启动master,在master机器的屏幕上会打印出来master的IP地址,这个IP地址在启动worker的时候会用到。

同样,另开一个shell,启动worker。master和worker的屏幕上会有部分log打印出来,worker和master的shell窗口都是不可交互的界面,如果需要交互显示,可以修改下面的default.cmd,改成后台启动。或者通过ssh的方式登陆master和worker。

启动好之后,就可以启动host机器上的spark-shell来做基本的测试了。 $SPARK_HOME/bin/MASTER=spark://masterip:port ./spark-shell ,这样就启动了shell


这里复用了host上编译好的spark可执行代码,所以需要host上安装spark。


docker/spark-test/base/dockerfile

FROM ubuntu:precise

RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list
RUN echo "deb http://cz.archive.ubuntu.com/ubuntu precise main" >> /etc/apt/sources.list
RUN echo "deb http://security.ubuntu.com/ubuntu precise-security main universe" >> /etc/apt/sources.list



# export proxy
ENV http_proxy http://www-proxy.xxxx.se:8080
ENV https_proxy http://www-proxy.xxxxx.se:8080
# Upgrade package index
RUN apt-get update

# install a few other useful packages plus Open Jdk 7
RUN apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server

ENV SCALA_VERSION 2.10.4
ENV CDH_VERSION cdh4
ENV SCALA_HOME /opt/scala-$SCALA_VERSION
ENV SPARK_HOME /opt/spark
ENV PATH $SPARK_HOME:$SCALA_HOME/bin:$PATH

# Install Scala
ADD http://www.scala-lang.org/files/archive/scala-$SCALA_VERSION.tgz /
RUN (cd / && gunzip < scala-$SCALA_VERSION.tgz)|(cd /opt && tar -xvf -)
RUN rm /scala-$SCALA_VERSION.tgz

如果需要设置代理的话,按照上面的方式设置。

在安装的过程中发现,有些软件无法正常安装,于是增加了2个新的apt源,如上的最后两个源是我加上的。


docker/spark-test/master/dockerfile

FROM spark-test-base
ADD default_cmd /root/
CMD ["/root/default_cmd"]


default.cmd

IP=$(ip -o -4 addr list eth0 | perl -n -e 'if (m{inet\s([\d\.]+)\/\d+\s}xms) { print $1 }')
echo "CONTAINER_IP=$IP"
export SPARK_LOCAL_IP=$IP
export SPARK_PUBLIC_DNS=$IP

# Avoid the default Docker behavior of mapping our IP address to an unreachable host name
umount /etc/hosts

/opt/spark/bin/spark-class org.apache.spark.deploy.master.Master -i $IP

看到这里就可以发现,其实basebuild出来,master和worker基本上只是启动的过程中执行不同的命令而已,本质上image与base区别很小


docker/spark-test/worker/dockerfile

FROM spark-test-base
ENV SPARK_WORKER_PORT 8888
ADD default_cmd /root/
ENTRYPOINT ["/root/default_cmd"]

default.cmd

IP=$(ip -o -4 addr list eth0 | perl -n -e 'if (m{inet\s([\d\.]+)\/\d+\s}xms) { print $1 }')
echo "CONTAINER_IP=$IP"
export SPARK_LOCAL_IP=$IP
export SPARK_PUBLIC_DNS=$IP

# Avoid the default Docker behavior of mapping our IP address to an unreachable host name
umount /etc/hosts

/opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker $1


你可能感兴趣的:(Spark runs on Docker)