spark streaming程序中代码在驱动器(driver)执行还在在执行器(executor)中执行的问题

word count的例子


import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;

import java.util.*;

public class WordCountMain {

    public static void main(String[] args) throws InterruptedException {

        System.out.println("=========== app start ============" + System.currentTimeMillis());

        //配置sparkConf参数
        SparkConf sparkConf = new SparkConf()
                .setAppName("SparkStreamingKafkaSql")
                // .setMaster("local[4]")
                //开启wal预写日志,保存数据源的可靠性
                // .set("spark.streaming.receiver.writeAheadLog.enable", "true")
                // 序列化方式设置
                .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
        // 打印的日志级别设置
        // javaSparkContext.setLogLevel("WARN");

        JavaStreamingContext jssc =
                new JavaStreamingContext(javaSparkContext,
                        Durations.seconds(10));
        System.out.println("=========== app config ============" + System.currentTimeMillis());

        // kafka环境设置
        Map kafkaParams = new HashMap();
        kafkaParams.put("bootstrap.servers", "192.168.254.41:6667,192.168.254.42:6667,192.168.254.43:6667");
        kafkaParams.put("zookeeper.connect", "192.168.254.41:2181,192.168.254.42:2181,192.168.254.43:2181");
        kafkaParams.put("group.id", "spark-sql-test-group");
        kafkaParams.put("auto.offset.reset", "largest");
        // 构建topic set
        Set topics = new HashSet();
        topics.add("spark-sql-test-topic");
        System.out.println("=========== kafka config ============" + System.currentTimeMillis());


        // 从kafka中读取数据
        // key保存是的数据的行数,value才是实际的数据
        JavaPairDStream inputDStream = KafkaUtils.createDirectStream(
                jssc,
                String.class,
                String.class,
                StringDecoder.class,
                StringDecoder.class,
                kafkaParams,
                topics
        );
        System.out.println("=========== kafka receive data========" + System.currentTimeMillis());

        // 获取topic中的数据
        JavaDStream dataDStream = inputDStream.map(x -> x._2);
        System.out.println("=========== map transform data========" + System.currentTimeMillis());

        JavaPairDStream wordsDStream =
                dataDStream.flatMapToPair(x -> {
                    List> wordsList = new ArrayList<>();
                    String[] words = x.split(" ");
                    for (String word : words) {
                        wordsList.add(new Tuple2<>(word, 1));
                    }
                    return wordsList.iterator();
                });
        System.out.println("=========== flatMapToPair transform data========" + System.currentTimeMillis());

        JavaPairDStream resultDStream =
                wordsDStream.reduceByKey(Integer::sum);
        System.out.println("=========== reduceByKey transform data========" + System.currentTimeMillis());


        resultDStream.foreachRDD(rdd -> {
            System.out.println("=========== forearchRdd ========" + System.currentTimeMillis());
            JavaPairRDD rdd1 =
                    rdd.mapToPair(x -> {
                        System.out.println("---mapToPair----" + x);
                        return new Tuple2("A:" + x._1(), x._2());
                    });
            System.out.println("=========== rdd mapToPair ========" + System.currentTimeMillis());
            rdd1.foreach(new VoidFunction>() {
                @Override
                public void call(Tuple2 tuple2) throws Exception {
                    System.out.println("---forearch----" + tuple2);
                }
            });
            System.out.println("=========== rdd foreach ========" + System.currentTimeMillis());
        });

        System.out.println("=========== app end1 ========" + System.currentTimeMillis());
        jssc.start();
        jssc.awaitTermination();
        System.out.println("=========== app end2 ========" + System.currentTimeMillis());
    }
}

提交到yarn上使用client方式进行执行。

驱动器上执行日志(只保留自己输出的内容)

=========== app start ============1589443070147
=========== app config ============1589443082106
=========== kafka config ============1589443082106
=========== kafka receive data========1589443082359
=========== map transform data========1589443082393
=========== flatMapToPair transform data========1589443082450
=========== reduceByKey transform data========1589443082513
=========== app end1 ========1589443082521

=========== forearchRdd ========1589443090150
=========== rdd mapToPair ========1589443090162
=========== rdd foreach ========1589443093749

=========== forearchRdd ========1589443100016
=========== rdd mapToPair ========1589443100018
=========== rdd foreach ========1589443100402

执行器上执行日志

spark streaming程序中代码在驱动器(driver)执行还在在执行器(executor)中执行的问题_第1张图片

Spark Streaming中代码位置的分类可以分为3类:

1 main函数内

2 DStream算子内

3 Rdd算子内

 

main函数内

DStream算子内

 Rdd算子内

执行次数

只在任务最开始执行一次

每个批次会执行一次

每个批次会执行一次

执行位置

驱动器

驱动器

执行器

spark streaming程序中代码在驱动器(driver)执行还在在执行器(executor)中执行的问题_第2张图片

    DStream封装的是Rdd,所以DSteam的操作可以是看做对DStream内的Rdd都做同样的逻辑操作。所以DStream的forearchRdd操作是不会产生job的,如果内部没有rdd的action操作。

使用技巧

    如果RDD内部处理需要一个其他的配置数据,这个配置文件中的数据是不变的,那么应该将配置数据的创建放在mian函数内,然后进行广播。如果这个配置是实时变化的,那么需要将配置数据的创建放入DStream中,然后再进行广播。

 

你可能感兴趣的:(spark,spark,kafka)