系统:Centos6.2
操作账号:root
Spark版本:spark-0.8.0-incubating-bin-hadoop1.tgz
jdk版本:jdk-7-linux-x64.rpm
scala版本:scala-2.9.3.tgz
使用的节点情况:(172.18.11.XX) XX=16,17,18,20,21,23,24,25,26,27,28,29,30,31,32,33,34 共17个节点
主控节点:hw024,以下操作如无特别说明全部在hw024上进行
注意:(1)因为我之前在24-28上安装过所有环境,所以以下for循环中没有这五个节点,但是如果要自行配置,需要在for循环中加入所有节点hostname
(2)注意,三台机器spark所在目录必须一致,因为master会登陆到worker上执行命令,master认为worker的spark路径与自己一样。
(a)在每个节点分别执行:
yum remove selinux* -y(预防ssh-keygen命令不成功)
ssh-keygen -t rsa(之后一路回车)
这将在/root/.ssh/目录下生成一个私钥id_rsa和一个公钥id_rsa.pub。
(b)将所有datanode节点的公钥id_rsa.pub传送到namenode上:
cp id_rsa.pub hw016.id_rsa.pub
scp hw016.id_rsa.pub "namenode节点ip地址":/root/.ssh
......
cp id_rsa.pub hw0XX.id_rsa.pub
scp hw0XX.id_rsa.pub namenode节点ip地址:/root/.ssh
(c)namenode节点上综合所有公钥(包括自身)并传送到所有节点上
cp id_rsa.pub authorized_keys 这是namenode自己的公钥
cat hw016.id_rsa.pub >> authorized_keys
......
cat hw0XX.id_rsa.pub >> authorized_keys
然后使用SSH协议将所有公钥信息authorized_keys复制到所有DataNode的.ssh目录下
scp authorized_keys “data节点ip地址”:/root/.ssh
这样配置过后,所有节点之间可以相互SSH无密码登陆,可以通过命令
“ssh 节点ip地址”来验证。
配置完毕,在namenode上执行“ssh 本机,所有数据节点”命令,因为ssh执行一次之后将不会再询问。
备注:用scp命令拷贝到所有节点时,可以放到脚本执行,脚本内容如下:
hosts内容如下:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
172.18.11.15 hw015
172.18.11.16 hw016
172.18.11.17 hw017
172.18.11.18 hw018
172.18.11.20 hw020
172.18.11.21 hw021
172.18.11.23 hw023
172.18.11.24 hw024
172.18.11.25 hw025
172.18.11.26 hw026
172.18.11.27 hw027
172.18.11.28 hw028
172.18.11.29 hw029
172.18.11.30 hw030
172.18.11.31 hw031
172.18.11.32 hw032
172.18.11.33 hw033
172.18.11.34 hw034
scp到所有节点的/etc/目录下
这一步是为了方便以后的操作,以后再拷贝其他文件,在脚本中就不用写ip。
#!/bin/bash
for dir in hw016 hw017 hw018 hw020 hw021 hw023 hw029 hw030 hw031 hw032 hw033 hw034
do
echo cping to $dir
scp -r jdk-7-linux-x64.rpm root@$dir:/home/xxx
done
进入每个节点的/home/xxx, 执行命令:rpm -ivh jdk-7-linux-x64.rpm
其自动安装到/usr/java/jdk1.7.0
解压scala-2.9.3.tgz到/home/xxx下,生成scala-2.9.3目录
拷贝scala-2.9.3到每个节点的/usr/lib下:
for dir in hw016 hw017 hw018 hw020 hw021 hw023 hw029 hw030 hw031 hw032 hw033 hw034
do
echo cping to $dir
scp -r scala-2.9.3 root@$dir:/usr/lib
done
(1)解压 spark-0.8.0-incubating-bin-hadoop1.tgz 到每个节点的/home/xxx目录下
由于我们已经将解压好的spark目录拷贝到各个节点上(见步骤3),故这里略去。
(2)设置环境变量
我们会在后面统一设置环境变量。
(3)设置spark的配置文件
设置spark源码目录下的/conf/slaves文件如下:
hw016
hw017
hw018
hw020
hw021
hw023
hw025
hw026
hw027
hw028
hw029
hw030
hw031
hw032
hw033
hw034
设置spark-env.sh文件,添加一行内容:
export SCALA_HOME=/usr/lib/scala-2.9.3
保存退出
(4)拷贝配置好的spark源码到所有slaves节点上:
#!/bin/bash
#for dir in hw016 hw017 hw018 hw020 hw021 hw023 hw029 hw030 hw031 hw032 hw033 hw034
for dir in hw016 hw017 hw018 hw020 hw021 hw023 hw024 hw025 hw026 hw027 hw028 hw029 hw030 hw031 hw032 hw033 hw034
#for dir in hw025 hw026 hw027 hw028
do
echo cping to $dir
scp -r spark-0.8.0-incubating-bin-hadoop1 root@$dir:/home/xxx
done
保存退出,spark配置完成
在hw024上设置好环境变量,一次性拷贝到各个节点上:
打开/etc/profile加入以下内容:
#scala
export SCALA_HOME=/usr/lib/scala-2.9.3
export PATH=$PATH:$SCALA_HOME/bin
#spark
export SPARK_HOME=/home/xxx/spark-0.8.0-incubating-bin-hadoop1
export PATH=$PATH:$SPARK_HOME/bin
#java
export JAVA_HOME=/usr/java/jdk1.7.0
export JRE_HOME=/usr/java/jdk1.7.0/jre
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
用for循环和scp命令拷贝到各个节点:
#!/bin/bash
for dir in hw016 hw017 hw018 hw020 hw021 hw023 hw029 hw030 hw031 hw032 hw033 hw034
do
echo cping to $dir
scp -r /etc/profile root@$dir:/etc/profile
done
分别在各个节点执行source /etc/profile,让修改的环境变量生效
spark的cluster模式需要使用hadoop的HDFS文件系统,需要先安装hadoop。
(1)在24节点上解压hadoop-1.2.1.tar.gz,并将解压后的目录放到/usr/local/hadoop/目录下
(2)配置hadoop:
修改/conf/hadoop-env.sh文件如下:
export JAVA_HOME=/usr/java/jdk1.7.0
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
修改core-site.xml如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>master.node</name>
<value>hw024</value>
<description>master</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>local dir</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://${master.node}:9000</value>
<description> </description>
</property>
</configuration>
修改hdfs-site.xml如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>${hadoop.tmp.dir}/hdfs/name</value>
<description>local dir</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>${hadoop.tmp.dir}/hdfs/data</value>
<description> </description>
</property>
</configuration>
修改mapred-siter.xml如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>${master.node}:9001</value>
<description> </description>
</property>
<property>
<name>mapred.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description> </description>
</property>
<property>
<name>mapred.system.dir</name>
<value>/tmp/mapred/system</value>
<description>hdfs dir</description>
</property>
</configuration>
(3)配置conf/master文件,加入namenode的主机名:
hw024
配置slaves文件,加入所有datanode的主机名:
hw016
hw017
hw018
hw020
hw021
hw023
hw024
hw025
hw026
hw027
hw028
hw029
hw030
hw031
hw032
hw033
hw034
(4)复制以上配置好的内容到所有节点上
#!/bin/bash
for dir in hw016 hw017 hw018 hw020 hw021 hw023 hw029 hw030 hw031 hw032 hw033 hw034
do
echo cping to $dir
#scp -r jdk-7-linux-x64.rpm root@$dir:/home/xxx
#scp -r spark-0.8.0-incubating-bin-hadoop1 root@$dir:/home/xxx
#scp -r profile root@$dir:/etc/profile
#scp -r scala-2.9.3.tgz root@$dir:/home/xxx
#scp -r slaves spark-env.sh root@$dir:/home/xxx/spark-0.8.0-incubating-bin-hadoop1/conf
#scp -r hosts root@$dir:/etc
scp -r /usr/local/hadoop/hadoop-1.2.1 root@$dir:/usr/local/hadoop/
#scp -r scala-2.9.3 root@$dir:/usr/lib
done
(5)在每个节点上创建hadoop使用的临时文件如下:
mkdir -p /usr/local/hadoop/tmp/hdfs/name
mkdir -p /usr/local/hadoop/tmp/hdfs/data
mkdir -p /usr/local/hadoop/tmp/mapred/local
mkdir -p /tmp/mapred/system
(6)关闭所有节点的防火墙:
/etc/init.d/iptables stop
(1)启动spark集群
在hw024的spark源码目录下执行:
./bin/start-all.sh
(2)检测进程是否启动:
[root@hw024 xxx]# jps
30435 Worker
16223 Jps
9032 SecondaryNameNode
9152 JobTracker
10283 Master
3075 TaskTracker
8811 NameNode
如果spark正常启动,jps运行结果如上所示。
浏览hw024的web UI(http://172.18.11.24:8080),此时应该可以看到所有的work节点,以及他们的CPU个数和内存信息。
(3)运行spark自带的例子:
运行SparkPi
$ ./run-example org.apache.spark.examples.SparkPi spark://hw024:7077
运行SparkKMeans
$ ./run-example org.apache.spark.examples.SparkKMeans spark://hw024:7077 ./kmeans_data.txt 2 1
运行wordcount
$ cd /home/xxx/spark-0.8.0-incubating-bin-hadoop1
$ hadoop fs -put README.md ./
$ MASTER=spark://master:7077 ./spark-shell
scala> val file = sc.textFile("hdfs://master:9000/user/dev/README.md")
scala> val count = file.flatMap(line => line.split(" ")).map(map => (word, 1)).reduceByKey( +)
scala> count.collect()
注意:wordcount例子中有两个地方需要根据之前hadoop的设置进行修改:
(1) MASTER=spark://master:7077 ./spark-shell: 其中斜体需要改为你的namenode的hostname
(2)scala> val file = sc.textFile("hdfs://master:9000/user/dev/README.md"):其中master要修改,而且后面的9000也要根据你的具体配置修改
主要参考文献:
1. http://www.yanjiuyanjiu.com/blog/20140202/
2. http://www.yanjiuyanjiu.com/blog/20131017/