直接上核心代码:
读kafka消息,
String topics = "topic-test";
Set topicsSet = new HashSet<>(Arrays.asList(topics.split(",")));
JavaInputDStream> messages = KafkaUtils.createDirectStream( javaStreamingContext, LocationStrategies.PreferConsistent(), ConsumerStrategies.Subscribe(topicsSet, kafkaParams));
JavaDStream lines = messages.map(ConsumerRecord::value).window(Durations.minutes(1), Durations.seconds(40));
lines.print();
数据格式化
JavaPairDStream, Integer> result = lines.mapToPair(s -> {
JSONObject json = JSON.parseObject(s);
Integer beginCityId = json.getInteger("beginCityId");
Integer endCityId = json.getInteger("endCityId");
Tuple2 tmpTuple2 = new Tuple2(beginCityId, endCityId);
return new Tuple2, Integer>(tmpTuple2, 1);
}).reduceByKey((s1, s2) -> s1 + s2);
result.print();
map函数数据处理
Function3, Optional, State, Tuple2, Integer>> mapppingFunc =
(tmpTuple2, one, state) -> {
int sum = one.orElse(0) + (state.exists() ? state.get() : 0);
Tuple2, Integer> output = new Tuple2<>(tmpTuple2, sum);
if (!state.isTimingOut()) {
state.update(sum);
}
return output;
};
调用mapWithStates
JavaMapWithStateDStream, Integer, Integer, Tuple2, Integer>> stateResult = result.mapWithState(StateSpec.function(mapppingFunc).timeout(Durations.seconds(60)));
stateResult.print();
注意下:
我这里使用timeout()。如果在Function3不判断state.isTimingOut(),直接update,会报错。
读kafka消息和上面一样,数据处理为,增加了timeSystem,为了控制删除统计结果数据。
JavaPairDStream,Tuple2> result= lines.mapToPair(s->{
JSONObject json=JSON.parseObject(s);
Integer beginCityId=json.getInteger("beginCityId");
Integer endCityId=json.getInteger("endCityId");
Long timeStamp=json.getLong("timeSystem");
Tuple2 key= new Tuple2(beginCityId,endCityId);
Tuple2 value = new Tuple2(timeStamp,1);
return new Tuple2,Tuple2>(key,value);
});
result.print();
update函数
Function2>, Optional>, Optional> > function2=(values,state)->{
Integer updateValue=0;
Long newTime=0L;
if(state.isPresent()){
updateValue=state.get()._2();
newTime=state.get()._1();
}
for(Tuple2 value:values){
updateValue+=value._2();
newTime=value._1();
}
if(System.currentTimeMillis()-1000*2*60>newTime){
return Optional.absent();
}
return Optional.of(new Tuple2(newTime,updateValue));
};
调用updateStateByKey函数
JavaPairDStream,Tuple2> result2= result.updateStateByKey(function2);
result2.print(100);
个人粗略认为,mapWithState与updateStateByKey,可以控制输出结果,支持统计结果update输出,不像updateStateByKey每次统计结果全部输出。官方建议使用mapWithState,内存消耗更少。