正确提交spark到yarn的demo

    通过Spark-submit在xshell提交命令行,如果集群配置了keberos的话需要在打包的jar中进行认证,认证文件上传到节点并且需要分发到每一个节点,节点之间需要无密码ssh登录。

    因为是通过Spark-submit提交程序,所以在代码当中的SparkConf设置为

.setMaster("yarn-cluster")

如果提交显示classnotfound可能是当前用户没有权限操作打包在集群上的jar包,或者是缺少命令--master yarn-cluster 。在这里,--master yarn-cluster 要和.setMaster("yarn-cluster")一致,不然会导致节点之间Connection的异常,而xshell一直显示Accepted。

以下是spark-submit提交的命令:

 spark-submit  \
 --class sparkDemo.WordCount \
 --master yarn-cluster \
 --num-executors 5 \
 --driver-memory 5g \
 --driver-cores 4 \
 --executor-memory 10g \
 --executor-cores 5 \
  hdfs://1.2.3.4:8020/bulkload/Jars/sub/SparkDemo.jar
"param1" "param2" "param3"

其中,param是可以传到spark程序main方法的args[]。 

以下是需要打包的类:

package sparkDemo;

import java.util.Arrays;

import kerberos.KerberosService;

import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

import common.HadoopUtil;

public class ParseAttachment {
	
    public static void main(String[] args) {
    	
		
		SparkConf conf_ = new SparkConf()
				.setMaster("yarn-cluster")
				.setAppName("parseAttachment");

		JavaSparkContext sc = new JavaSparkContext(conf_);
		JavaRDD text = sc.textFile(HadoopUtil.hdfs_url + "/bulkload/Spark_in");

    	System.out.println("ok");
    	
    	JavaRDD words = text.flatMap(new FlatMapFunction() {
            private static final long serialVersionUID = 1L;
            @Override
            public Iterable call(String line) throws Exception {
                return Arrays.asList(line.split(" "));//把字符串转化成list
            }
        });
        
        JavaPairRDD pairs = words.mapToPair(new PairFunction() {
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2 call(String word) throws Exception {
                // TODO Auto-generated method stub
                return new Tuple2(word, 1);
            }
        });
        
        JavaPairRDD results = pairs.reduceByKey(new Function2() {            
            private static final long serialVersionUID = 1L;
            @Override
            public Integer call(Integer value1, Integer value2) throws Exception {
                // TODO Auto-generated method stub
                return value1 + value2;
            }
        });
        
        JavaPairRDD temp = results.mapToPair(new PairFunction, Integer, String>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2 call(Tuple2 tuple)
                    throws Exception {
                return new Tuple2(tuple._2, tuple._1);
            }
        });
        
        JavaPairRDD sorted = temp.sortByKey(false).mapToPair(new PairFunction, String, Integer>() {
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2 call(Tuple2 tuple)
                    throws Exception {
                // TODO Auto-generated method stub
                return new Tuple2(tuple._2,tuple._1);
            }
        });
        
        sorted.foreach(new VoidFunction>() {
            private static final long serialVersionUID = 1L;
            @Override
            public void call(Tuple2 tuple) throws Exception {
                System.out.println("word:" + tuple._1 + " count:" + tuple._2);
            }
        });
        
        sc.close();
    }
}

至于spark的依赖包:spark-assembly-1.5.2-hadoop2.6.0.jar 导进去就可以了。

 

注:

1.如果报错:the directory item limit is exceed: limit=1048576。因为一般使用spark进行批处理,所以输出的结果文件数量可能非常多,事实上Linux会限制一个文件夹的文件数量,这时候参考:https://blog.csdn.net/sparkexpert/article/details/51852944修改hdfs-site.xml吧;

2.因为一般处理的文件数量很庞大,所以在代码规范上一定要注意,比如读取文件的流要及时手动关闭,或者通过参数:spark.yarn.executor.memoryOverhead来调节内存分配;

3.报错:YarnScheduler: Lost executor。修改执行器等待参数(10分钟): --conf spark.core.connection.ack.wait.timeout=600。

你可能感兴趣的:(Spark)