Spark完全分布式集群搭建

 

 

 

第一步:ssh免密登陆

               详情局域网ssh登陆

              添加hosts   

vim /etc/hosts
#ip            对应的主机名
202.4.136.218  master
202.4.136.186  node1
202.4.136.15   node2

第二步:下载所需软件

               1.java

               2.scala

              3.hadoop

              4.spark

第三步:环境变量配置

  确保第二步所下的软件的位置与如下对应,PYSPARK_PYTHON的地址是防止driver与executor所用的python版本不一致导致报错。        

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/sbin:$SPARK_HOME/bin
export PATH=$PATH:$HADOOP_HOME/etc/hadoop
export PYSPARK_PYTHON=/home/sparknode/anaconda3/bin/python

第四步:hadoop与spark配置文件

  hadoop:core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml,slaves,hadoop-env.sh

#core-site.xml




    fs.default.name
    hdfs://master:9000



   hadoop.temp.dir
   /usr/local/hadoop/tmp



#hdfs-site.xml




    dfs.namenode.secondary.http-address
    master:50090




   dfs.replication
   2


   dfs.namenode.name.dir
   file:/usr/local/hadoop/tmp/dfs/name


   dfs.datanode.data.dir
   file:/usr/local/hadoop/tmp/dfs/data



   dfs.permissions
   true



#mapred-site.xml




   mapreduce.framework.name
   yarn


#yarn-site.xml



    yarn.resourcemanager.hostname
    master




   yarn.nodemanager.aux-services
   mapreduce_shuffle


   yarn.log-aggregation-enable
   true



   yarn.log-aggregation.retain-seconds
   604800



    yarn.nodemanager.pmem-check-enabled
    true



    yarn.nodemanager.vmem-check-enabled
    true




#vim slaves
#注释掉localhost
#localhost
master
node1
node2
#hadoop-env.sh

#这个主要是添加JAVA_HOME,如果没有在这添加,即使环境变量添加了,启动hadoop时仍会报为找到JAVE_home

export HADOOP_IDENT_STRING=$USER
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_COMMON_LIB_NATIVE_DIR="/usr/local/hadoop/lib/native/"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/"

配置好文件后,初始化

hadoop namenode -format

(如果有新加节点,只需对新节点初始化,若对其它节点也重新初始化时,要保证clusterID是一致的,否则会找不到datanode节点,骚操作是把 /usr/local/hadoop的tmp删除,当然要保证tmp的东西不重要了,全部节点的tmp都要删除,不然未删除的clusterID就不一样了,当然也可手动修改clusterID,具体位置 /usr/local/hadoop/tmp/dfs/name/current/VERSION,未试过新增节点时不动tmp直接 -format ,可尝试)

启动hadoop

start-all.sh

-----------------------------------------------------------------------------------------------------

spark:spark-env.sh,spark-defaults.conf

#spark-env.sh

export SCALA_HOME=/usr/share/scala
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
SPARK_MASTER_IP=master
SPARK_LOCAL_DIRS=/usr/local/spark
#SPARK_DRIVER_MEMORY=6g 
#SPARK_DRIVER_CORES=8
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH
#spark-defaults.conf

spark.yarn.jars  hdfs:///usr/local/spark/spark_jars/*

  spark-defaults.conf是在使用spark yarn报错时添加的,具体看最后坑4, 该路径是hdfs路径,先把/usr/local/spark/jars上传到hdfs上

 

   启动spark

  因为start-all.sh与hadoop的冲突,所以最好把 /usr/local/spark/sbin/start-all.sh 修改为 spark-all.sh

 spark-start.sh

---------------------------------------------------------------------------------------------------

UI界面默认端口

Hadoop namenode    50070   master:50070

               yarn              8088       master:8088

spark     集群              8080       master:8080

              spark.job     4040        master:4040

-----------------------------------------------------------------------------------------------------------

踩过的坑:(遇到就更新)

1.启动后 jps 正常,但是hadoop UI界面检测不到从节点信息,重启后有可能修正

2.添加节点,对hadoop -format 导致clusterID不一致,节点出现异常,上面讲过

3.yarn界面检测到的核数量与内存大小与真实集群的不一致

   yarn默认每台机器  8核8G  如果不是,则需要修改配置文件yarn-site.xml

  添加(修改为自己机器的实际大小,内存以M为单位,每台机器的配置文件都得修改)



              yarn-nodemanager.resource.memory-mb
              4096





              yarn-nodemanager.resource.cpu-vcores
              4

4.spark-submit --master yarn --deploy-mode cluster  **.py 使用yarn集群运行spark出错

#报错内容
Exception in thread "main" org.apache.spark.SparkException: Application application_1543628881761_0001 finished with failed status
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 原因是找不到jars包。

可把spark安装目录下的jars包上传到hdfs上,并在  spark-defaults.conf中添加该hdfs路径

#spark-defaults.conf

spark.yarn.jars  hdfs:///usr/local/spark/spark_jars/*

5.YARN UI界面发现不健康的节点

原因:可能是安装hadoop的磁盘空间满了,hadoop好像每两分钟会自动检测一下磁盘空间问题?磁盘如果使用超90%会导致logs无法写入,则报错。

解决办法:把logs的写入目录移动到空间足够大的磁盘上。

#修改hadoop-env.sh,添加log路径
export HADOOP_LOG_DIR=/+路径

#以及修改 yarn-env.sh 添加log路径
export YARN_LOG_DIR=

注意:还要把新目录的权限交出来,否则 master无权访问

sudo chmod -R 777 +文件路径

你可能感兴趣的:(学习笔记)