Spark Streaming之updateStateByKey和mapWithState比较

一 UpdateStateByKey

UpdateStateByKey:统计全局的key的状态,但是就算没有数据输入,他也会在每一个批次的时候返回之前的key的状态。假设5s产生一个批次的数据,那么5s的时候就会更新一次的key的值,然后返回。

这样的缺点就是,如果数据量太大的话,而且我们需要checkpoint数据,这样会占用较大的存储。

 

如果要使用updateStateByKey,就需要设置一个checkpoint目录,开启checkpoint机制。因为key的state是在内存维护的,如果宕机,则重启之后之前维护的状态就没有了,所以要长期保存它的话需要启用checkpoint,以便恢复数据。

 

public classSparkUpdateKeyByState {
    public staticvoid main(String[]args) {
        SparkConf conf= new SparkConf().setAppName("SparkUpdate Key By State").setMaster("local[*]");
        JavaStreamingContextjssc= new JavaStreamingContext(conf,Durations.seconds(2));


       
String checkpointDir= "hdfs://hdfs-cluster/ecommerce/checkpoint/state";
        jssc.checkpoint(checkpointDir);

        JavaReceiverInputDStream<String>messages = jssc.socketTextStream("hadoop-all-01",9999);
        JavaDStream<String>words = messages.flatMap(newFlatMapFunction<String,String>() {
            public Iterator<String>call(String s) throws Exception {
                return Arrays.asList(s.split(" ")).iterator();
            }
        });

        JavaPairDStream<String,Integer> pairs = words.mapToPair(newPairFunction<String,String, Integer>() {
            public Tuple2<String,Integer> call(Stringword) throwsException {
                return new Tuple2(word,1);
            }
        });

        // 统计全局的word count,而不是单一的某一批次
       
JavaPairDStream<String,Integer>wordcounts = pairs.updateStateByKey
            (new Function2<List<Integer>,Optional<Integer>,Optional<Integer>>() {
                // 参数valueList:相当于这个batch,这个key新的值,可能有多个,比如(hadoop,1(hadoop,1)传入的可能是(1,1)
                //
参数oldState:就是指这个key之前的状态
               
public Optional<Integer>call(List<Integer>valueList, Optional<Integer>oldState) throwsException {
                    Integer newState= 0;
                    // 如果oldState之前已经存在,那么这个key可能之前液晶被统计过,否则说明这个key第一次出现
                   
if (oldState.isPresent()) {
                        newState = oldState.get();
                    }

                    // 更新state
                   
for (Integervalue : valueList) {
                        newState += value;
                    }
                    return Optional.of(newState);
                }
            });

        jssc.start();
        try {
            jssc.awaitTermination();
        } catch (InterruptedExceptione) {
            e.printStackTrace();
        }
        jssc.close();
    }
}

 

二 MapWithState

MapWithState:也是用于全局统计key的状态,但是它如果没有数据输入,便不会返回之前的key的状态,有一点增量的感觉。

 

这样做的好处是,我们可以只是关心那些已经发生的变化的key,对于没有数据输入,则不会返回那些没有变化的key的数据。这样的话,即使数据量很大,checkpoint也不会像updateStateByKey那样,占用太多的存储。

objectMapWithStateKafkaDirect extendsApp {
    val conf = new SparkConf().setAppName("MapWithStateKafkaDirect").setMaster("local[*]")
    val sc = SparkContext.getOrCreate(conf)
    val checkpointDir= "hdfs://hdfs-cluster/ecommerce/checkpoint/2"
   
val brokers= "hadoop-all-01:9092,hadoop-all-02:9092,hadoop-all-03:9092"

   
def mappingFunction(key:String,value:Option[Int],state:State[Long]):(String,Long) = {
        // 获取之前状态的值
       
val oldState= state.getOption().getOrElse(0L)
        // 计算当前状态值
       
val newState= oldState + value.getOrElse(0)
        // 更新状态值
       
state.update(newState)
        // 返回结果
       
(key,newState)
    }

    val spec = StateSpec.function[String,Int,Long,(String,Long)](mappingFunction_)
    def creatingFunc():StreamingContext= {
        val ssc = new StreamingContext(sc,Seconds(5))
        val kafkaParams= Map(
            "metadata.broker.list"->brokers,
            "group.id"->"MapWithStateKafkaDirect"
       
)
        val topics= Set("count")
        val messages= KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc,kafkaParams,topics)
        val results:DStream[(String,Long)] =messages
           
.filter(_._2.nonEmpty)
            .mapPartitions(iter =>{
                iter.flatMap(_._2.split(" ").map((_,1)))
            })
            .mapWithState(spec)
        results.print()
        ssc.checkpoint(checkpointDir)
        ssc
   
}

    def close(ssc:StreamingContext,millis:Long):Unit = {
        new Thread(newRunnable {
            override def run(): Unit = {
                // 当某个条件触发而关闭StreamingContext
               
Thread.sleep(millis)
                println("满足条件,准备关闭StreamingContext")
                ssc.stop(true,true)
                println("成功关闭StreamingContext")
            }
        }).start()
    }

    val ssc = StreamingContext.getOrCreate(checkpointDir,creatingFunc)

    ssc.start()
    ssc.awaitTermination()
    close(ssc,20000)
}

 

 

你可能感兴趣的:(大数据/spark)