spark streaming 广播变量空指针异常问题&广播变量更新

这两天在使用spark中的用到了广播变量,大致逻辑是从Redis中读取黑名单配置,然后广播到各个节点用于异常监控,但是在使用过程中总是报空指针异常,后面百度了很多资料,发现有说Yarn集群中不支持广播变量的,有说Sparkstreaming不支持广播变量更新的,有说是spark闭包问题的等等各种,最后笔者去查了sparkstreaming官方文档才学会了广播变量的正确使用方法,并将过程记录下来。

先上一版报空指针的异常代码(简化版)

public class TetsBroadcast {

    private static volatile Broadcast<List<String>> broadcast = null;//定义广播变量

    public static void main(String[] args) throws Exception {


        /**初始化SparkContext**/
        String brokers = ParmUtil.getBrokers();
        String topics = "test";
        SparkConf conf = new SparkConf()
                .setAppName("test-broadcast")
                .set("spark.shuffle.blockTransferService", "nio");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(60000));

        /**初始化广播变量**/
        String[] arr = new String[3];
        //查Redis获取配置信息写到数组arr...
        broadcast = sc.broadcast(Arrays.asList(arr));

        /**读Kafka数据**/
        Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
        Map<String, Object> kafkaParams = new HashMap<>();
        kafkaParams.put("metadata.broker.list", brokers);
        kafkaParams.put("group.id", "test");
        JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
                jssc,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
        JavaDStream<String> lines = messages.map(ConsumerRecord::value);

        /**使用广播变量**/
        lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
            @Override
            public void call(JavaRDD<String> stringJavaRDD) throws Exception {
                System.out.println("广播变量值:"+broadcast.value());
                JavaRDD<String> alertRdd = stringJavaRDD.filter(new Function<String, Boolean>() {
                    @Override
                    public Boolean call(String s) throws Exception {
                        String[] arg = s.split("\\u0001");
                        String ip = arg[2];
                        if (!broadcast.value().contains(ip))
                            return false;
                        return true;
                    }
                });
                alertRdd.take(10);
            }
        });

        jssc.start();
        jssc.awaitTermination();
    }
}

如上,在类中定义了广播变量,并在main方法中进行了值得初始化广播,在foreachRDD中使用广播变量,但是在yarn集群中运行总会在if (!broadcast.value().contains(ip)) 这一行报空指针异常,且奇怪的是在System.out.println(“广播变量值:”+broadcast.value()) 这一行是可以正常打印出来广播变量的值的,以上说明了广播变量只在Driver端有效,而filter算子是在Excutor端指定并未拿到初始化的广播变量值。

然后在StackOverflow中找到这么一段解释

This is because your broadcast variable is in class level. And since when the class is initialized in the worker node it will not see the value you assigned in the main method. It will only see a null since the broadcast variable is not initialized to anything. The Solution i found was to pass the broadcast variable to the method when calling the method. This is also the case for Accumulators

大致意思是说,广播变量是在类中定义,在main函数在driver端初始化赋值的时候work节点并不能看到复制后的广播变量,只能看到在类中定义的null,也就报了我们上述代码运行时的异常。

然后为了彻底搞明白广播变量到底怎么用、怎么更新,特去翻看了spark streaming的官方文档,照着官方文档的写法更新了自己的代码,实现无空指针异常,并在spark streaming中更新广播变量,代码如下

public class TetsBroadcast {

    public static void main(String[] args) throws Exception {
        /**初始化SparkContext**/
        String brokers = ParmUtil.getBrokers();
        String topics = "test";
        SparkConf conf = new SparkConf()
                .setAppName("test-broadcast")
                .set("spark.shuffle.blockTransferService", "nio");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(60000));

        /**读Kafka数据**/
        Set<String> topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
        Map<String, Object> kafkaParams = new HashMap<>();
        kafkaParams.put("metadata.broker.list", brokers);
        kafkaParams.put("group.id", "test");
        JavaInputDStream<ConsumerRecord<String, String>> messages = KafkaUtils.createDirectStream(
                jssc,
                LocationStrategies.PreferConsistent(),
                ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
        JavaDStream<String> lines = messages.map(ConsumerRecord::value);

        /**调用广播变量初始化方法使用广播变量**/
        lines.foreachRDD(new VoidFunction<JavaRDD<String>>() {
            @Override
            public void call(JavaRDD<String> stringJavaRDD) throws Exception {
                Broadcast<List<String>> broadcast = IpList.getInstance(new JavaSparkContext(stringJavaRDD.context()));//获取广播变量
                JavaRDD<String> alertRdd = stringJavaRDD.filter(new Function<String, Boolean>() {
                    @Override
                    public Boolean call(String s) throws Exception {
                        String[] arg = s.split("\\u0001");
                        String ip = arg[2];
                        if (!broadcast.value().contains(ip))
                            return false;
                        return true;
                    }
                });
                alertRdd.take(10);
            }
        });

        jssc.start();
        jssc.awaitTermination();
    }
}

class IpList {
    private static volatile Broadcast<List<String>> instance = null;

    /**
     * 实现初始化并更新广播变量
     **/
    public static Broadcast<List<String>> getInstance(JavaSparkContext jsc) {
        String[] arr = new String[3];
        //查Redis获取配置信息写到数组arr...
        if (instance == null) {
            synchronized (IpList.class) {
                instance = jsc.broadcast(Arrays.asList(arr));

            }
        } else {
            synchronized (IpList.class) {
                instance.unpersist();
                instance = jsc.broadcast(Arrays.asList(arr));
            }
        }
        return instance;
    }
}

如上,在方法中初始化变量,在driver端拿到广播变量并在excutor端使用,便可实现广播变量的正常使用,避免空指针异常,且每次调用前会先unpersist释放再获取最新的配置值重新广播,实现了广播变量的动态更新。

你可能感兴趣的:(大数据)