Spark Streaming广播变量更新问题

最近在使用Spark Streaming进行流式计算过程中,遇到在过滤函数中需要用到外部过滤条件列表,且列表会随时更新,一开始只是在main函数中获取过滤条件列表,但是后来发现streaming程序每次触发并非重新执行一遍main函数,部分代码(个人理解为非spark DAG有向图中rdd依赖链中的代码,也就是在driver端执行的这一部分)只会在streaming程序启动的时候执行一次,因此也就没办法做到实时更新,后来了解到broadcast广播变量,并在foreachRdd中更新广播变量,上代码

private static volatile Broadcast> broadcast = null;

SparkConf conf = new SparkConf()
                .setMaster("local[*]")
                .setAppName("venus-monitor")
                .set("spark.shuffle.blockTransferService", "nio");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.setLogLevel("WARN");
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(60000));

broadcast = sc.broadcast(getDomainSet());

JavaDStream computelog=null;
computelog.foreachRDD(new VoidFunction>() {
    @Override
    public void call(JavaRDD stringJavaRDD) throws Exception {
        //获取广播变量值
        broadcast.value();
        //释放广播变量值
        broadcast.unpersist();
        //重新广播,更新值列表
        sc.broadcast(getDomainSet());
    }
});

运行报错:DStream checkpointing has been enabled but the DStreams with their functions are not serializable

16:27:00,236 ERROR [main] internal.Logging$class (Logging.scala:91) - Error starting the context, marking it as stopped
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.api.java.JavaSparkContext
Serialization stack:
	- object not serializable (class: org.apache.spark.api.java.JavaSparkContext, value: org.apache.spark.api.java.JavaSparkContext@3aa41da1)
	- field (class: com.pingan.cdn.log.VenusMonitor$3, name: val$sc, type: class org.apache.spark.api.java.JavaSparkContext)
	- object (class com.pingan.cdn.log.VenusMonitor$3, com.pingan.cdn.log.VenusMonitor$3@26586b74)
	- field (class: org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1, name: foreachFunc$1, type: interface org.apache.spark.api.java.function.VoidFunction)
	- object (class org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1, )
	- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
	- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, )
	- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
	- object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@77a074b4)
	- element of array (index: 0)
	- array (class [Ljava.lang.Object;, size 16)
	- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
	- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@77a074b4))
	- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
	- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [
0 checkpoint files 

经排查,原因为传入的JavaSparkContext sc无法序列化的原因,把 sc.broadcast(getDomainSet())改为通过rdd获取SparkContext 并更新广播变量新即可

 public void call(JavaRDD stringJavaRDD) throws Exception {
        //获取广播变量值
        broadcast.value();
        //释放广播变量值
        broadcast.unpersist();
        //重新广播,更新值列表
        broadcast=stringJavaRDD.context().broadcast(getDomainSet(), ClassManifestFactory.classType(Set.class));
    }

你可能感兴趣的:(大数据)