最近看了不少Docker相关的文档,也做了不少相关的实验。想着能不能在docker上运行spark,然后google了一下,发现
1 https://registry.hub.docker.com/u/amplab/spark-master/ ,https://github.com/amplab/docker-scripts 这里的两个应该是同一个,不过前者没给guide
2 spark官方源码里面提供了docker的子项目
本文采用spark提供的docker子项目来构建Spark集群
Docker的安装,这里不再做叙述。请参考官方文档。
下面我们看一看Docker子项目的目录结构
docker
---------build
---------readme.md
---------spark-test
----------------------build
----------------------readme.md
----------------------base
-------------------------------dockerfile
----------------------master
-------------------------------dockerfile
-------------------------------default_cmd
----------------------worker
-------------------------------dockerfile
-------------------------------default_cmd
根据目录结构可以看出这里是要构造3个image,分别是base, master, worker
docker/build, 这个脚本调用spark-test下面的build脚本,这里只做第一步检查,不用sudo的情况下药可以执行docker的命令
docker images > /dev/null || { echo Please install docker in non-sudo mode. ; exit; } ./spark-test/build
1. 如果还没有docker group就添加一个:
sudo groupadd docker
2.将用户加入该group内。然后退出并重新登录就生效啦。
sudo gpasswd -a ${USER} docker
3.重启docker
sudo service docker restart
大功告成!
docker/spark-test/build,这里的脚本开始调用下层的dockerfile,分别创建image
docker build -t spark-test-base spark-test/base/ docker build -t spark-test-master spark-test/master/ docker build -t spark-test-worker spark-test/worker/
Spark Docker files usable for testing and development purposes. These images are intended to be run like so: docker run -v $SPARK_HOME:/opt/spark spark-test-master docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://<master_ip>:7077 Using this configuration, the containers will have their Spark directories mounted to your actual `SPARK_HOME`, allowing you to modify and recompile your Spark source and have them immediately usable in the docker images (without rebuilding them).
在host机器上
安装Scala-2.10.x版本
安装Spark 1.1.0版本
sudo vi /etc/profile,增加$SPARK_HOME, $SCALA_HOME, $SPARK_BIN, $SCALA_BIN环境变量设置。source /etc/profile
另开一个shell,启动master,在master机器的屏幕上会打印出来master的IP地址,这个IP地址在启动worker的时候会用到。
同样,另开一个shell,启动worker。master和worker的屏幕上会有部分log打印出来,worker和master的shell窗口都是不可交互的界面,如果需要交互显示,可以修改下面的default.cmd,改成后台启动。或者通过ssh的方式登陆master和worker。
启动好之后,就可以启动host机器上的spark-shell来做基本的测试了。 $SPARK_HOME/bin/MASTER=spark://masterip:port ./spark-shell ,这样就启动了shell
这里复用了host上编译好的spark可执行代码,所以需要host上安装spark。
docker/spark-test/base/dockerfile
FROM ubuntu:precise RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list RUN echo "deb http://cz.archive.ubuntu.com/ubuntu precise main" >> /etc/apt/sources.list RUN echo "deb http://security.ubuntu.com/ubuntu precise-security main universe" >> /etc/apt/sources.list # export proxy ENV http_proxy http://www-proxy.xxxx.se:8080 ENV https_proxy http://www-proxy.xxxxx.se:8080 # Upgrade package index RUN apt-get update # install a few other useful packages plus Open Jdk 7 RUN apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server ENV SCALA_VERSION 2.10.4 ENV CDH_VERSION cdh4 ENV SCALA_HOME /opt/scala-$SCALA_VERSION ENV SPARK_HOME /opt/spark ENV PATH $SPARK_HOME:$SCALA_HOME/bin:$PATH # Install Scala ADD http://www.scala-lang.org/files/archive/scala-$SCALA_VERSION.tgz / RUN (cd / && gunzip < scala-$SCALA_VERSION.tgz)|(cd /opt && tar -xvf -) RUN rm /scala-$SCALA_VERSION.tgz
在安装的过程中发现,有些软件无法正常安装,于是增加了2个新的apt源,如上的最后两个源是我加上的。
docker/spark-test/master/dockerfile
FROM spark-test-base ADD default_cmd /root/ CMD ["/root/default_cmd"]
IP=$(ip -o -4 addr list eth0 | perl -n -e 'if (m{inet\s([\d\.]+)\/\d+\s}xms) { print $1 }') echo "CONTAINER_IP=$IP" export SPARK_LOCAL_IP=$IP export SPARK_PUBLIC_DNS=$IP # Avoid the default Docker behavior of mapping our IP address to an unreachable host name umount /etc/hosts /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master -i $IP
docker/spark-test/worker/dockerfile
FROM spark-test-base ENV SPARK_WORKER_PORT 8888 ADD default_cmd /root/ ENTRYPOINT ["/root/default_cmd"]
IP=$(ip -o -4 addr list eth0 | perl -n -e 'if (m{inet\s([\d\.]+)\/\d+\s}xms) { print $1 }') echo "CONTAINER_IP=$IP" export SPARK_LOCAL_IP=$IP export SPARK_PUBLIC_DNS=$IP # Avoid the default Docker behavior of mapping our IP address to an unreachable host name umount /etc/hosts /opt/spark/bin/spark-class org.apache.spark.deploy.worker.Worker $1