spark部署yarn集群

spark官方地址:http://spark.apache.org/docs/latest/running-on-yarn.html

环境

  • linux centos
  • hadoop-2.6.0
  • spark-1.3.1-bin-hadoop2.6
  • jdk1.7 python2.6

spark部署yarn模式集群,必须先启动hadoop中的yarn,然后在spark上提交应用到yarn集群上运行。

步骤一:搭建hadoop集群

以下步骤必须在每一台机器上都配置

1.集群节点分布

修改hosts文件,一台主机,两个分节点

10.35.57.12  namenode1
10.35.57.10  datanode1
10.35.14.49  datanode2

2.配置ssh免密码登录

配置使得各节点间互相能够面密码ssh登录,互相拷贝一下公钥私钥即可。这样是为了在启动hadoop集群时不需要频繁输入验证密码。

3.环境变量配置

修改/root/.bash_profile 对所有用户永久生效

export HADOOP_HOME=/opt/hadoop-2.6.0
export HADOOP_PID_DIR=/data/hadoop/pids
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export SPARK_HOME=/opt/spark-1.3.1-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin

修改完后保存 # source ~/.bash_profile生效。

4.创建相关路径

mkdir -p /data/hadoop/{pids,storage}
mkdir -p /data/hadoop/storage/{hdfs,tmp}
mkdir -p /data/hadoop/storage/hdfs/{name,data}

5.配置文件

切换到hadoop-2.6.0/etc/hadoop目录。
core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode1:9000</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/data/hadoop/storage/tmp</value>
    </property>
    <property>
        <name>hadoop.proxyuser.hadoop.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.hadoop.groups</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.native.lib</name>
        <value>true</value>
    </property>
</configuration>

其中的namenode与hosts文件中的主机名对应。

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>namenode1:50090</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/data/hadoop/storage/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/data/hadoop/storage/hdfs/data</value>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.webhdfs.enabled</name>
        <value>true</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>namenode1:10020</value>
    </property>

    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>namenode1:19888</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>namenode1:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>namenode1:8031</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>namenode1:8032</value>
    </property>
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>namenode1:8033</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>namenode1:80</value>
    </property>
</configuration>

6.配置 hadoop-env.sh、mapred-env.sh、yarn-env.sh在开头添加

都要添加内容

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64 
export CLASS_PATH=$JAVA_HOME/lib:$JAVA_HOME/jre/lib

export HADOOP_HOME=/opt/hadoop-2.6.0
export HADOOP_PID_DIR=/data/hadoop/pids
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

export HADOOP_PREFIX=$HADOOP_HOME

export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native

export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

其中java相关的变量根据自己的机器添加。

7.数据节点的配置

修改slaves文件,如果没有就自己创建一个,添加

datanode1
datanode2

8.启动hadoop集群

在namenode1的机器上

hdfs namenode -format
./sbin/start-all.sh

先格式化hdfs,然后启动集群以及yarn资源管理(也可以分布执行start-dfs.shstart-yarn.sh
注意格式化hdfs会在对应文件夹生成VERSION文件,要保证所有机器上的clusterID保持一致,否则会导致节点启动不了
启动过程:

[root@cass12 hadoop]# start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
15/05/13 10:45:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [namenode1]
namenode1: starting namenode, logging to /opt/hadoop-2.6.0/logs/hadoop-root-namenode-cass12.out
datanode2: starting datanode, logging to /opt/hadoop-2.6.0/logs/hadoop-root-datanode-localhost.localdomain.out
datanode1: starting datanode, logging to /opt/hadoop-2.6.0/logs/hadoop-root-datanode-localhost.localdomain.out
Starting secondary namenodes [namenode1]
namenode1: starting secondarynamenode, logging to /opt/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-cass12.out
15/05/13 10:45:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop-2.6.0/logs/yarn-root-resourcemanager-cass12.out
datanode2: starting nodemanager, logging to /opt/hadoop-2.6.0/logs/yarn-root-nodemanager-localhost.localdomain.out
datanode1: starting nodemanager, logging to /opt/hadoop-2.6.0/logs/yarn-root-nodemanager-localhost.localdomain.out

在namenode1主机上查看:

[root@cass12 hadoop]# jps
23658 Jps
23245 SecondaryNameNode
23400 ResourceManager
23055 NameNode

在datanode1和datanode2节点机器上查看:

[root@localhost bin]# jps
22696 Jps
22552 NodeManager
22445 DataNode
[root@localhost bin]# jps
6919 DataNode
7023 NodeManager
7158 Jps

此时主机和两个节点都成功启动。
用浏览器查看hadoop集群的状态。
输入10.35.57.12:50070查看hadoop概览。
spark部署yarn集群_第1张图片
输入10.35.57.12:80查看hadoop应用管理。
spark部署yarn集群_第2张图片

如果要停止集群,在namenode1机器上执行stop-all.sh即可。

步骤二:在yarn上部署spark应用

以下步骤必须在每一台机器上都配置

1.配置文件

切换到spark的conf目录
修改spark-env.sh.template为spark-env.sh文件,配置环境变量,添加

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk.x86_64
export HADOOP_HOME=/opt/hadoop-2.6.0
SPARK_LOCAL_DIR="/data/spark/tmp"

修改slaves.template为slaves文件,配置节点,添加

datanode1
datanode2

修改log4j.properties.template为log4j.properties文件。

2.运行程序

可以按照standalone方式启动spark集群,或者直接提交应用到yarn集群上去,用spark作计算。

$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn-cluster \     --num-executors 3 \     --driver-memory 4g \     --executor-memory 2g \     --executor-cores 1 \     --queue thequeue \     lib/spark-examples*.jar \
    10

或者用yarn-client模式打开spark的shell命令行

$ ./bin/pyspark --master yarn-client

值得注意的是,在任意一个分节点上打开pyspark的shell交互命令行,如果设计读入文档的代码,它会自动连到主机节点的hdfs目录下寻找。
详细情况见官网。

你可能感兴趣的:(hadoop,集群,spark)