7.pyspark在Hadoop Yarn上运行

1.复制LICENSE.txt

    cp /usr/local/hadoop/LICENSE.txt ~/wordcount/input

2.启动所有虚拟机

    参考Hadoop集群搭建

3.启动集群

    start-all.sh

3.上传文件至HDFS

 (1)在HDFS创建目录

           hadoop fs -mkdir -p /user/hduser/wordcount/input

 (2)切换至~/wordcount/input 数据文件目录

           cd ~/wordcount/input

 (3)上传文本文件到HDFS

           hadoop fs -put LICENSE.txt /user/hduser/wordcount/input

4.进入Hadoop YARN运行界面

    HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client

    (1).Neither spark.yarn.jars nor spark.yarn.archive is set

         1.在hdfs上创建目录

             hdfs dfs -mkdir   /spark_jars

         2.hdfs dfs -put /opt/spark/jars/*    /spark_jars

         3.修改在spark的conf的spark-default.conf

           spark.yarn.jars=hdfs://master:9000/spark_jars/*

(2).ERROR YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.

           1.查看classpath配置

              hadoop classpath             

/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/home/hduser/jdk1.8.0_211/lib/tools.jar:/usr/local/hadoop/contrib/capacity-scheduler/*.jar

         2.在yarn-site.xml中添加配置项


 yarn.application.classpath
 /usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/home/hduser/jdk1.8.0_211/lib/tools.jar:/usr/local/hadoop/contrib/capacity-scheduler/*.jar
 

(3).Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

    检查 spark-default.conf 配置是否有误

7.pyspark在Hadoop Yarn上运行_第1张图片

  

你可能感兴趣的:(Hadoop,Yarn,云计算,大数据+机器学习)