Spark DStream的mapWithStates和updateStateByKey使用

mapWithStates使用demo

直接上核心代码:
读kafka消息,

String topics = "topic-test";
Set topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
JavaInputDStream> messages = KafkaUtils.createDirectStream( javaStreamingContext, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
JavaDStream lines = messages.map(ConsumerRecord::value).window(Durations.minutes(1), Durations.seconds(40));
lines.print();
        
   

数据格式化

JavaPairDStream, Integer> result = lines.mapToPair(s -> {
            JSONObject json = JSON.parseObject(s);
            Integer beginCityId = json.getInteger("beginCityId");
            Integer endCityId = json.getInteger("endCityId");
            Tuple2 tmpTuple2 = new Tuple2(beginCityId, endCityId);
            return new Tuple2, Integer>(tmpTuple2, 1);
        }).reduceByKey((s1, s2) -> s1 + s2);
result.print();

map函数数据处理

Function3, Optional, State, Tuple2, Integer>> mapppingFunc =
                (tmpTuple2, one, state) -> {
                    int sum = one.orElse(0) + (state.exists() ? state.get() : 0);
                    Tuple2, Integer> output = new Tuple2<>(tmpTuple2, sum);
                    if (!state.isTimingOut()) {
                        state.update(sum);
                    }
                    return output;
                };

调用mapWithStates

JavaMapWithStateDStream, Integer, Integer, Tuple2, Integer>>  stateResult = result.mapWithState(StateSpec.function(mapppingFunc).timeout(Durations.seconds(60)));
stateResult.print();

注意下:
我这里使用timeout()。如果在Function3不判断state.isTimingOut(),直接update,会报错。


updateStateByKey使用demo

读kafka消息和上面一样,数据处理为,增加了timeSystem,为了控制删除统计结果数据。

JavaPairDStream,Tuple2> result= lines.mapToPair(s->{
            JSONObject json=JSON.parseObject(s);
            Integer beginCityId=json.getInteger("beginCityId");
            Integer endCityId=json.getInteger("endCityId");
            Long timeStamp=json.getLong("timeSystem");
            Tuple2 key= new Tuple2(beginCityId,endCityId);
            Tuple2 value = new Tuple2(timeStamp,1);
            return new Tuple2,Tuple2>(key,value);
        });
result.print();

update函数

Function2>, Optional>, Optional> > function2=(values,state)->{
            Integer updateValue=0;
            Long newTime=0L;
            if(state.isPresent()){
                updateValue=state.get()._2();
                newTime=state.get()._1();
            }
            for(Tuple2 value:values){
                updateValue+=value._2();
                newTime=value._1();
            }
            if(System.currentTimeMillis()-1000*2*60>newTime){
                return Optional.absent();
            }
            return Optional.of(new Tuple2(newTime,updateValue));

        };

调用updateStateByKey函数

JavaPairDStream,Tuple2>  result2= result.updateStateByKey(function2);
result2.print(100);

总结

个人粗略认为,mapWithState与updateStateByKey,可以控制输出结果,支持统计结果update输出,不像updateStateByKey每次统计结果全部输出。官方建议使用mapWithState,内存消耗更少。

你可能感兴趣的:(Spark)