spark官方地址:http://spark.apache.org/docs/latest/running-on-yarn.html
spark部署yarn模式集群,必须先启动hadoop中的yarn,然后在spark上提交应用到yarn集群上运行。
以下步骤必须在每一台机器上都配置
修改hosts文件,一台主机,两个分节点
10.35.57.12 namenode1
10.35.57.10 datanode1
10.35.14.49 datanode2
配置使得各节点间互相能够面密码ssh登录,互相拷贝一下公钥私钥即可。这样是为了在启动hadoop集群时不需要频繁输入验证密码。
修改/root/.bash_profile
对所有用户永久生效
export HADOOP_HOME=/opt/hadoop-2.6.0
export HADOOP_PID_DIR=/data/hadoop/pids
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export SPARK_HOME=/opt/spark-1.3.1-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
修改完后保存 # source ~/.bash_profile
生效。
mkdir -p /data/hadoop/{pids,storage}
mkdir -p /data/hadoop/storage/{hdfs,tmp}
mkdir -p /data/hadoop/storage/hdfs/{name,data}
切换到hadoop-2.6.0/etc/hadoop
目录。
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://namenode1:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/data/hadoop/storage/tmp</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hadoop.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.native.lib</name>
<value>true</value>
</property>
</configuration>
其中的namenode与hosts文件中的主机名对应。
hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>namenode1:50090</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/hadoop/storage/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/hadoop/storage/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>namenode1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>namenode1:19888</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>namenode1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>namenode1:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>namenode1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>namenode1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>namenode1:80</value>
</property>
</configuration>
都要添加内容
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
export CLASS_PATH=$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export HADOOP_HOME=/opt/hadoop-2.6.0
export HADOOP_PID_DIR=/data/hadoop/pids
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
其中java相关的变量根据自己的机器添加。
修改slaves文件,如果没有就自己创建一个,添加
datanode1
datanode2
在namenode1的机器上
hdfs namenode -format
./sbin/start-all.sh
先格式化hdfs,然后启动集群以及yarn资源管理(也可以分布执行start-dfs.sh
和start-yarn.sh
)
注意格式化hdfs会在对应文件夹生成VERSION文件,要保证所有机器上的clusterID保持一致,否则会导致节点启动不了
启动过程:
[root@cass12 hadoop]# start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/05/13 10:45:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [namenode1]
namenode1: starting namenode, logging to /opt/hadoop-2.6.0/logs/hadoop-root-namenode-cass12.out
datanode2: starting datanode, logging to /opt/hadoop-2.6.0/logs/hadoop-root-datanode-localhost.localdomain.out
datanode1: starting datanode, logging to /opt/hadoop-2.6.0/logs/hadoop-root-datanode-localhost.localdomain.out
Starting secondary namenodes [namenode1]
namenode1: starting secondarynamenode, logging to /opt/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-cass12.out
15/05/13 10:45:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop-2.6.0/logs/yarn-root-resourcemanager-cass12.out
datanode2: starting nodemanager, logging to /opt/hadoop-2.6.0/logs/yarn-root-nodemanager-localhost.localdomain.out
datanode1: starting nodemanager, logging to /opt/hadoop-2.6.0/logs/yarn-root-nodemanager-localhost.localdomain.out
在namenode1主机上查看:
[root@cass12 hadoop]# jps
23658 Jps
23245 SecondaryNameNode
23400 ResourceManager
23055 NameNode
在datanode1和datanode2节点机器上查看:
[root@localhost bin]# jps
22696 Jps
22552 NodeManager
22445 DataNode
[root@localhost bin]# jps
6919 DataNode
7023 NodeManager
7158 Jps
此时主机和两个节点都成功启动。
用浏览器查看hadoop集群的状态。
输入10.35.57.12:50070
查看hadoop概览。
输入10.35.57.12:80
查看hadoop应用管理。
如果要停止集群,在namenode1机器上执行stop-all.sh
即可。
以下步骤必须在每一台机器上都配置
切换到spark的conf目录
修改spark-env.sh.template为spark-env.sh文件,配置环境变量,添加
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
export HADOOP_HOME=/opt/hadoop-2.6.0
SPARK_LOCAL_DIR="/data/spark/tmp"
修改slaves.template为slaves文件,配置节点,添加
datanode1
datanode2
修改log4j.properties.template为log4j.properties文件。
可以按照standalone方式启动spark集群,或者直接提交应用到yarn集群上去,用spark作计算。
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ --queue thequeue \ lib/spark-examples*.jar \
10
或者用yarn-client模式打开spark的shell命令行
$ ./bin/pyspark --master yarn-client
值得注意的是,在任意一个分节点上打开pyspark的shell交互命令行,如果设计读入文档的代码,它会自动连到主机节点的hdfs目录下寻找。
详细情况见官网。