作者:王东阳
前言
状态在Flink中叫作State,用来保存中间计算结果或者缓存数据。根据是否需要保存中间结果,分为无状态计算和有状态计算。对于流计算而言,事件持续不断地产生,如果每次计算都是相互独立的,不依赖于上下游的事件,则是无状态计算。如果计算需要依赖于之前或者后续的事件,则是有状态计算。
在Flink中根据数据集是否根据Key进行分区,将状态分为Keyed State和Operator State(Non-keyed State)两种类型 ,本文主要介绍Keyed State,后续文章会介绍Operator State。
Keyed State 表示和key相关的一种State,只能用于KeydStream类型数据集对应的Functions和Operators之上。Keyed State是Operator State的特例,区别在于Keyed State事先按照key对数据集进行了分区,每个Key State仅对应一个Operator和Key的组合。Keyed State可以通过Key Groups进行管理,主要用于当算子并行度发生变化时,自动重新分布 Keyed State数据。在系统运行过程中,一个Keyed算子实例可能运行一个或者多个Key Groups的keys。
按照数据结构的不同,Flink中定义了多种Keyed State,具体如下:
ValueState
ListState
MapState
ReducingState
AggregatingState
State开发实战
本章节将通过实际的项目代码演示不同数据结构在不同计算场景下的开发方法。
本文中项目 pom文件
ValueState
ValueState
由于ValueState
接下来我们将通过一个具体的实例代码展示如何利用ValueState去统计各个用户的交易次数。
ValueStateCountUser实现
通过定义ValueState
在其中的 open(Configuration parameters) 方法中对 countState进行初始化。Flink提供了RuntimeContext用于获取状态数据,同时RuntimeContext提供了常用的Managed Keyd State的获取方式,可以通过创建相应的StateDescriptor并调用RuntimeContext方法来获取状态数据。例如获取ValueState可以调用ValueState[T] getState(ValueStateDescriptor[T])方法。
在 Tuple2
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
public class ValueStateCountUser extends RichMapFunction
private ValueState
public ValueStateCountUser() {}
@Override
public void open(Configuration parameters) throws Exception {
// super.open(parameters);
countState = getRuntimeContext().getState(new ValueStateDescriptor("count", Integer.class));
}
@Override
public Tuple2
Integer count = countState.value();
if (count == null) {
countState.update(1);
return Tuple2.of(s.f0, 1);
}
countState.update(count + 1);
return Tuple2.of(s.f0, count + 1);
}
@Override
public void close() throws Exception {
// super.close();
System.out.println(String.format("finally %d", countState.value()));
}
}
代码地址:ValueStateCountUser.java
验证代码
编写验证代码如下
private static void valueStateUserCount() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream> inputStream = env.fromElements(
new Tuple2<>("zhangsan","aa"),
new Tuple2<>("lisi","aa"),
new Tuple2<>("zhangsan","aa"),
new Tuple2<>("lisi","aa"),
new Tuple2<>("wangwu","aa")
);
inputStream.keyBy(new KeySelector, String>() {
@Override
public String getKey(Tuple2 integerLongTuple2) throws Exception {
return integerLongTuple2.f0;
}
})
.map(new ValueStateCountUser())
.print();
env.execute("StateCompute");
}
代码地址:valueStateUserCount
程序运行得到输出如下
(zhangsan,1)
(lisi,1)
(zhangsan,2)
(lisi,2)
(wangwu,1)
finally 1
可以看到对于流数据中的每一个元素,根据用户名进行统计,实时显示该用户已经访问了多少次。
ListState
ListState
由于ListState保存的是与Key对应元素列表的状态,状态中存放元素的List列表,应用场景可以是例如定义ListState存储用户经常访问的IP地址。
接下来我们将通过一个具体的实例代码展示如何利用ListState去统计各个用户访问过的IP地址列表。为了更广泛的展示State的应用场景,本样例代码中将会基于KeyedProcessFunction来处理流元素。
ListStateUserIP 实现
声明ListState
在open(Configuration parameters)中利用 getRuntimeContext().getListState 对 ipState 进行初始化,getListState需要传入ListStateDescriptor类型的参数。
在 processElement中,将当前记录中的IP地址追究到 ipState中,然后从ipState 获取已经访问过的所有IP地址,和用户名一起放入到Collector
import com.google.common.collect.Lists;
import java.util.List;
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
public class ListStateUserIpKeyedProcessFunction
extends KeyedProcessFunction
private ListState
@Override
public void open(Configuration parameters) throws Exception {
// super.open(parameters);
ipState = getRuntimeContext().getListState(new ListStateDescriptor("ipstate", String.class));
}
@Override
public void processElement(
Tuple2 value,
KeyedProcessFunction, Tuple2>>.Context context,
Collector>> collector) throws Exception {
ipState.add(value.f1);
List ips = Lists.newArrayList(ipState.get());
collector.collect(Tuple2.of(value.f0, ips));
}
}
代码地址:ListStateUserIpKeyedProcessFunction.java
验证代码
编写验证程序如下
private static void listStateUserIP() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream> inputStream = env.fromElements(
new Tuple2<>("zhangsan","192.168.0.1"),
new Tuple2<>("lisi","192.168.0.1"),
new Tuple2<>("zhangsan","192.168.0.2"),
new Tuple2<>("lisi","192.168.0.3"),
new Tuple2<>("wangwu","192.168.0.1")
);
inputStream.keyBy(new KeySelector, String>() {
@Override
public String getKey(Tuple2 integerLongTuple2) throws Exception {
return integerLongTuple2.f0;
}
})
.process(new ListStateUserIpKeyedProcessFunction())
.print();
env.execute("listStateUserIP");
}
}
代码地址:listStateUserIP
执行结果打印如下
(zhangsan,[192.168.0.1])
(lisi,[192.168.0.1])
(zhangsan,[192.168.0.1, 192.168.0.2])
(lisi,[192.168.0.1, 192.168.0.3])
(wangwu,[192.168.0.1])
达到预期效果
MapState
MapState
get()方法获取值
put(),putAll()方法更新值
remove()删除某个key
contains()判断是否存在某个key
isEmpty() 判断是否为空
和HashMap接口相似,MapState也可以通过entries()、keys()、values()获取对应的keys或values的集合 。
接下来我们通过一个具体的实例介绍如何通过ReducingState
实现KeyedProcessFunction
定义 MapState
在 open方法中通过getRuntimeContext().getMapState初始化minMaxState,getMapState 需要传入MapStateDescriptor作为参数,构建MapStateDescriptor的第二个参数标识Map中Key的类型,第三个参数标识Map中Value的类型
在processElement方法中,将输入的流元素也就是用户当前的体温记录和minMaxState中的最低温和最高温进行对比,并更新到minMaxState中。
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
public class MapStateMinMaxTempKeyedProcessFunction
extends KeyedProcessFunction, Tuple3> {
private MapState
private final String minKey = "min";
private final String maxKey = "max";
@Override
public void open(Configuration parameters) throws Exception {
// super.open(parameters);
minMaxState = getRuntimeContext().getMapState(
new MapStateDescriptor("minMaxState",
String.class,
Integer.class));
}
@Override
public void processElement(
Tuple2 in,
KeyedProcessFunction, Tuple3>.Context context,
Collector> collector) throws Exception {
if (!minMaxState.contains(minKey)) {
minMaxState.put(minKey, in.f1);
}
if (!minMaxState.contains(maxKey)) {
minMaxState.put(maxKey, in.f1);
}
if (in.f1 > minMaxState.get(maxKey)) {
minMaxState.put(maxKey, in.f1);
}
if (in.f1 < minMaxState.get(minKey)) {
minMaxState.put(minKey, in.f1);
}
collector.collect(Tuple3.of(in.f0, minMaxState.get(minKey), minMaxState.get(maxKey)));
}
}
代码地址:MapStateMinMaxTempKeyedProcessFunction.java
验证代码
编写验证代码如下:
private static void mapStateMinMaxUserTemperature() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream> inputStream = env.fromElements(
new Tuple2<>("zhangsan",36),
new Tuple2<>("lisi",37),
new Tuple2<>("zhangsan",35),
new Tuple2<>("lisi",38),
new Tuple2<>("wangwu",37)
);
inputStream.keyBy(new KeySelector, String>() {
@Override
public String getKey(Tuple2 integerLongTuple2) throws Exception {
return integerLongTuple2.f0;
}
})
.process(new MapStateMinMaxTempKeyedProcessFunction())
.print();
env.execute("mapStateMinMaxUserTemperature");
}
代码地址:mapStateMinMaxUserTemperature
程序运行输出
(zhangsan,36,36)
(lisi,37,37)
(zhangsan,35,36)
(lisi,37,38)
(wangwu,37,37)
可以看到对于每一条用户记录,正确计算输出了当前的最低和最高低温,达到了预期的计算需求。
ReducingState
ReducingState
ReducingState获取元素使用T get()方法。
接下来我们通过一个具体的实例介绍如何通过ReducingState
定义ReduceFunction
实现 reduce方法,求当前最大值
new ReduceFunction
@Override
public Integer reduce(Integer integer, Integer t1) throws Exception {
return integer > t1 ? integer : t1;
}
}
实现KeyedProcessFunction
声明ReducingState
在open(Configuration parameters)中利用 getRuntimeContext().getReducingState 对 tempState进行初始化,getReducingState需要传入ReducingStateDescriptor,ReducingStateDescriptor构建的时候,第二个参数要参入ReduceFunction;
在 processElement中,将当前用户的当前体温传入到 tempState中(tempState在内部会调用ReduceFunction), 然后从tempState 获取用户目前为止的最高提问,和用户名一起放入到Collector
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.common.state.ReducingState;
import org.apache.flink.api.common.state.ReducingStateDescriptor;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
public class ReducingStateKeyedProcessFunction
extends KeyedProcessFunction
private ReducingState
@Override
public void open(Configuration parameters) throws Exception {
// super.open(parameters);
tempState = getRuntimeContext().getReducingState(new ReducingStateDescriptor("reduce",
new ReduceFunction() {
@Override
public Integer reduce(Integer integer, Integer t1) throws Exception {
return integer > t1 ? integer : t1;
}
}, Integer.class));
}
@Override
public void processElement(
Tuple2 stringIntegerTuple2,
KeyedProcessFunction, Tuple2>.Context context,
Collector> collector) throws Exception {
tempState.add(stringIntegerTuple2.f1);
collector.collect(Tuple2.of(stringIntegerTuple2.f0, tempState.get()));
}
}
代码地址:ReducingStateKeyedProcessFunction.java
验证代码
private static void reducingStateUserTemperature() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream> inputStream = env.fromElements(
new Tuple2<>("zhangsan",36),
new Tuple2<>("lisi",37),
new Tuple2<>("zhangsan",35),
new Tuple2<>("lisi",38),
new Tuple2<>("wangwu",37)
);
inputStream.keyBy(new KeySelector, String>() {
@Override
public String getKey(Tuple2 integerLongTuple2) throws Exception {
return integerLongTuple2.f0;
}
})
.process(new ReducingStateKeyedProcessFunction())
.print();
env.execute("reducingStateUserTemperature");
}
代码地址:reducingStateUserTemperature
程序运行结果如下
(zhangsan,36)
(lisi,37)
(zhangsan,36)
(lisi,38)
(wangwu,37)
达到预期效果
AggregatingState
AggregatingState
AggregatingState
定义 AggregateFunction
AggregateFunction
IN: 流元素的类型;
ACC: AggregateFunction内部累加器的类型
OUT: 最终输出的聚合结果的类型
AggregateFunction
createAccumulator 用于创建累加器Accumulator
add 用于将流元素加入到累加器Accumulator
getResult 用于从累加器Accumulator中计算获取聚合结果
merge 目前暂时用不到
代码如下
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.java.tuple.Tuple2;
public class AvgTempAggregateFunction
implements AggregateFunction, Tuple2, Double> {
@Override
public Tuple2
return Tuple2.of(0, 0);
}
@Override
public Tuple2
Tuple2 in, Tuple2 acc) {
// in 流元素, acc 内部累加器
acc.f0 = acc.f0 + in.f1;
acc.f1 = acc.f1 + 1;
return acc;
}
@Override
public Double getResult(Tuple2
return new Double(acc.f0)/ acc.f1;
}
@Override
public Tuple2
Tuple2 a, Tuple2 b) {
return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
}
}
代码地址:AvgTempAggregateFunction.java
实现 KeyedProcesssFunction
声明AggregatingState
在open(Configuration parameters)中利用 getRuntimeContext().getAggregatingState 对 avgState进行初始化,getAggregatingState需要传入AggregatingStateDescriptor,AggregatingStateDescriptor构建的时候,第二个参数要参入AggregateFunction,第三个参数要传入AggregateFunction中累加器的TypeInformation;
在 processElement中,将当前用户的当前体温传入到 avgState中(avgState内部会调用add),然后从avgState 获取用户体温的聚合结果(avgState内部调用getResult ),和用户名一起放入到Collector
import org.apache.flink.api.common.state.AggregatingState;
import org.apache.flink.api.common.state.AggregatingStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;
public class AggregatingStateKeyedProcessFunction
extends KeyedProcessFunction
private AggregatingState
@Override
public void open(Configuration parameters) throws Exception {
// super.open(parameters);
avgState = getRuntimeContext().getAggregatingState(new AggregatingStateDescriptor,
Tuple2, Double>("aggregating", new AvgTempAggregateFunction(),
TypeInformation.of(new TypeHint>() {})));
}
@Override
public void processElement(
Tuple2 in,
KeyedProcessFunction, Tuple2>.Context context,
Collector> collector) throws Exception {
avgState.add(in);
collector.collect(Tuple2.of(in.f0, avgState.get().doubleValue()));
}
}
代码地址:AggregatingStateKeyedProcessFunction.java
验证代码
private static void aggregatingStateAvgUserTemperature() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream> inputStream = env.fromElements(
new Tuple2<>("zhangsan",36),
new Tuple2<>("lisi",37),
new Tuple2<>("zhangsan",35),
new Tuple2<>("lisi",38),
new Tuple2<>("wangwu",37)
);
inputStream.keyBy(new KeySelector, String>() {
@Override
public String getKey(Tuple2 integerLongTuple2) throws Exception {
return integerLongTuple2.f0;
}
})
.process(new AggregatingStateKeyedProcessFunction())
.print();
env.execute("aggregatingStateAvgUserTemperature");
}
代码地址:aggregatingStateAvgUserTemperature,程序运行结果如下
(zhangsan,36.0)
(lisi,37.0)
(zhangsan,35.5)
(lisi,37.5)
(wangwu,37.0)
可以看到对于每一条流元素,输出了用户以及当前的平均体温,完成了预期功能需求。
总结
本文通过典型的场景用例,展示了不同数据结构的State在不同计算场景下的使用方法,帮助Flink开发者熟悉State相关API以及State在相应的计算场景中如何存储、查询相关的中间计算信息从而完成计算目的。
参考资料
《Flink原理、实战与性能优化》5.1 有状态计算
《Flink内核原理与实现》第7章 状态原理
Flink教程(17) Keyed State状态管理之AggregatingState使用案例 求平均值 https://blog.csdn.net/winterk...
发布于 2022-03-11 09:57