Spark提交任务

1. 代码



    4.0.0

    org.learn.example
    spark-example
    1.0-SNAPSHOT

    
        
            org.apache.spark
            spark-core_2.11
            2.2.0
            provided
        

        
            org.apache.spark
            spark-sql_2.11
            2.2.0
            provided
        

        
            org.apache.spark
            spark-streaming_2.11
            2.2.0
            provided
        

        
            org.apache.spark
            spark-streaming-kafka-0-10_2.11
            2.2.0
            provided
        

        
            org.apache.commons
            commons-lang3
            3.1
            provided
        

        
            org.apache.kafka
            kafka-clients
            1.1.0
        
    

    
        
            
                org.apache.maven.plugins
                maven-compiler-plugin
                3.7.0
                
                    1.7
                    1.7
                    UTF-8
                
            
        
    


public class SparkJob implements Serializable {

    public static void main(String[] args) {
        String appName = args[0];

        SparkSession session = SparkSession.builder().appName(appName).getOrCreate();

        try {
            String fullClassName = "org.learn.example.jobs." + appName;
            Class clazz = Class.forName(fullClassName);
            Method method = clazz.getDeclaredMethod("run", SparkSession.class);
            method.invoke(clazz.newInstance(), session);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}



public interface ISparkJob extends Serializable {
    final SimpleDateFormat DATE_FORMAT = new SimpleDateFormat("yyyyMMddHHmmss");

    void run(SparkSession session);
}



public class WordCount implements ISparkJob {

    @Override
    public void run(SparkSession session) {
        JavaRDD javaRDD = session.createDataset(Arrays.asList("aaa bbb", "bbb ccc", "aaa"), Encoders.STRING()).javaRDD();

        javaRDD.flatMap(new FlatMapFunction() {
            @Override
            public Iterator call(String line) throws Exception {
                Iterator iterator = Arrays.asList(line.split(" ")).iterator();
                return iterator;
            }
        }).mapToPair(new PairFunction() {
            @Override
            public Tuple2 call(String word) throws Exception {
                return new Tuple2<>(word, 1);
            }
        }).reduceByKey(new Function2() {
            @Override
            public Integer call(Integer v1, Integer v2) throws Exception {
                return v1 + v2;
            }
        }).saveAsTextFile("/home/mr/example/result/" + DATE_FORMAT.format(new Date()));
    }

}

2. client模式

submit-client.sh

#!/bin/bash

/home/bin/spark-submit --master spark://bd01:7077,bd02:7077 --executor-memory 1G --total-executor-cores 2 --class org.learn.example.SparkJob ../spark-example-1.0-SNAPSHOT.jar ${1}

解释

  • 不指定模式默认是client模式
  • bd01、bd02需要在/etc/hosts中配置:
11.11.11.11 bd01
22.22.22.22 bd02
127.0.0.1 localhost
  • 需要使用本地jar包。
    经测试发现:client模式无法使用HDFS上的jar包(version 2.0.2),会报错:
Warning: Skip remote jar hdfs:/home/mr/example/spark-example-1.0-SNAPSHOT.jar.
java.lang.ClassNotFoundException: org.learn.example.SparkJob
        at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:270)
        at org.apache.spark.util.Utils$.classForName(Utils.scala:228)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:693)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  • 打jar包时需要排除Spark和Hadoop,即在Maven中指定为provided

When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime.


3. cluster模式

submit-cluster.sh

#!/bin/bash

/home/bin/spark-submit --master spark://bd01:7077,bd02:7077 --deploy-mode cluster --executor-memory 1G --total-executor-cores 2 --class org.learn.example.SparkJob hdfs:/home/mr/example/spark-example-1.0-SNAPSHOT.jar ${1}

解释:

  • 在cluster模式下,jar包需要放到集群可以访问的目录,如果是放在本地,需要在每个节点上都放置一份

The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.

注意:如果使用本地jar包,但不是所有节点都有此jar包,则可能会出错:

ERROR org.apache.spark.deploy.ClientEndpoint: Exception from cluster was: java.nio.file.NoSuchFileException: /home/mr/example/spark-example-1.0-SNAPSHOT.jar
java.nio.file.NoSuchFileException: /home/mr/example/spark-example-1.0-SNAPSHOT.jar
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:520)
        at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
        at java.nio.file.Files.copy(Files.java:1225)
        at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:607)
        at org.apache.spark.util.Utils$.copyFile(Utils.scala:578)
        at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:663)
        at org.apache.spark.util.Utils$.fetchFile(Utils.scala:462)
        at org.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:154)
        at org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:83)

现象
假设现有bd01、bd02、bd03三个节点,这三个节点上都有worker进程。

bd01、bd02上都存放了jar包,bd03上没有此jar包。

cluster模式提交的时候,如果driver进程启在bd01、bd02上,则程序正确执行;如果driver进程启在bd03上则会报上述错误。

原因
cluster模式下,driver进程由worker进程启动,

当启动driver进程时需要从jar包中获取代码启动SparkContext,

如果spark-submit脚本中配置的jar包路径是HDFS,则worker进程会去HDFS上下载此jar包到本地目录(SPARK_WORKER_DIR配置的Worker日志目录)的各driver目录下,然后利用driver目录下的jar包启动driver进程;

如果spark-submit脚本中配置的jar包路径是本地目录,则worker进程会去此目录找jar包,并将此jar包复制到本地目录(SPARK_WORKER_DIR配置的Worker日志目录)的各driver目录下,然后利用driver目录下的jar包启动driver进程。
当spark-submit脚本中配置的jar包路径中没有jar包时,则会报上述NoSuchFileException的错误。

启动日志中有相应的描述:

"org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker@bd01:35201" "/data/work/driver-20190413101025-2532/spark-example-1.0-SNAPSHOT.jar" "org.learn.example.SparkJob" "WordCount"

查看bd01的/data/work/driver-20190413101025-2532目录:

你可能感兴趣的:(Spark提交任务)