所有的数据处理工具,都离不开数据加工,我总结了最最常用的数据加工操作只有下面四种:
Flink 中有那些转化操作呢?就是下面的这些,在 Flink 中这叫做算子。
下面是代码:
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichFilterFunction;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
/**
* @className: TranslateDataShow
* @Description:
* @Author: wangyifei
* @Date: 2023/2/20 15:37
*/
public class TranslateDataShow {
public static void main(String[] args) throws Exception {
// showMapUsage();
// showFlatMapUsage();
showProcessUsage();
}
private static void showProcessUsage() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
Collection<String> list = new ArrayList<>();
list.add("ok,1,2,3,4,5");
SingleOutputStreamOperator<String> src = env.fromCollection(list);
src.filter(new RichFilterFunction<String>() {
@Override
public boolean filter(String value) throws Exception {
return value.split(",")[0].equals("ok");
}
}).process(new ProcessFunction<String, String>() {
@Override
public void processElement(String s, ProcessFunction<String, String>.Context context, Collector<String> collector) throws Exception {
// 模拟 flatMap 的功能
Arrays.stream(s.split("//s*,//s*")).forEach(x->{
collector.collect(x);
});
// 模拟 filter 的功能
// if(s.split("//s*,//s*")[0].equals("ok")){
// collector.collect(s);
// }
}
}).print();
env.execute();
}
private static void showFlatMapUsage() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
Collection<String> list = new ArrayList<>();
list.add("ok,1,2,3,4,5");
SingleOutputStreamOperator<String> src = env.fromCollection(list);
src.filter(new RichFilterFunction<String>() {
@Override
public boolean filter(String value) throws Exception {
return value.split(",")[0].equals("ok");
}
}).flatMap(new FlatMapFunction<String,String>(){
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
for(String e:value.split(",")){
out.collect(e);
}
}
}).print("------");
env.execute("test-flatMap");
}
private static void showMapUsage() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
Collection<String> list = new ArrayList<>();
list.add("ok,1");
list.add("ok,2");
list.add("ok,3");
list.add("no,4");
list.add("ok,5");
SingleOutputStreamOperator<String> src = env.fromCollection(list);
src.filter(new FilterFunction<String>() {
@Override
public boolean filter(String value) throws Exception {
return value.split(",")[0].equals("ok");
}
}).map(new MapFunction<String, String>() {
@Override
public String map(String value) throws Exception {
return value.split(",")[1];
}
}).print("------");
env.execute("test-dateStream");
}
}
Flink 分组的函数只有一个 keyBy(KeySelector) 。 这里给出一个计算 word count 的例子。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
env.socketTextStream("127.0.0.1", 6666)
.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
collector.collect(new Tuple2<String, Integer>(s + "@" + (ThreadLocalRandom.current().nextInt(100)), 1));
}
})
.keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
@Override
public String getKey(Tuple2<String, Integer> record) throws Exception {
return record.getField(0);
}
}).sum(1)
.print("----");
env.execute();
简单的说一下,先使用 socketTextStream() 读取 word ,这里使用了 nc 来作为服务器,nc -l 12.0.0.1 -p 6666
, 来模拟一个客户端,在 nc 中输入一个单词后回车,socketTextStream() 就能读取 word , flatMap 将 word 转化为 Tuple2
只是将相同属性的数据聚集到一起没有任何意义,一般的情况是对分组里面的数据进行聚合才有意义。也就是上一个例子中的 sum ,那除了 sum 还有那些算子呢?
来一条处理一条的算子,可以直接跟在 KeyedStream 后面的算子
缓存起来,一批一批的处理
先从简单的讲起来,min 函数可以指定字段的序号,或者字段的名称。看一下下面的例子。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(6);
Collection<String> list = new ArrayList<>();
list.add("ll aa c d e hello work word");
list.add("void snapshotState(FunctionSnapshotContext context) void");
env.fromCollection(list)
.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
Arrays.stream(s.split("\\s+")).forEach(x -> {
collector.collect(new Tuple2<String, Integer>(x, 1));
});
}
}).keyBy(x->x.getField(0))
.sum(1)
.keyBy(x->0)
.max(1).print("-------");
// 将 max 改成 maxBy(1) 看看效果有什么不同。
env.execute();
当使用 max 函数,得到的结果中,可以获得最大的值,但是 word 不一定是最多的 void 哪里,可以能是任意一个。只有在使用 maxBy 的时候,才能将最大值对应的 word 取出来。
同理的 min 和 minBy 也是同样的功能。
大家看看 aggregate 类函数的实现方式就会知道,他们都是有 reduce 来实现的。所以现在讲讲 reduce 。
reduce 有两种实现方式,一种是实现 ReduceFunction 接口,另外一种是继承 RichReduceFunction 抽象类。RichReduceFunction 的好处是它有 open 方法,并且它有 getRuntimeContext().getXXState() 方法,通过它可以拿到 key state 。这样就可以在 ReduceFunction 函数里面使用 keyed state 了。
ReduceFunction 需要实现的方法是 reduce(T agg, T input) , 其中 agg 是累计值,此累计值有 Flink 帮我保存,每次调用 reduce 的时候,就将最新的值传进去,input 值是下一条数据。有了着两个东西,我们就可以实现 sum、max、maxBy、min、minBy 的功能了。请下面使用 reduce 模拟 max 的功能:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> source = env.socketTextStream("127.0.0.1", 6666);
source.map(new MapFunction<String,Tuple2<String,Integer>>(){
@Override
public Tuple2<String,Integer> map(String s) throws Exception {
String[] split = s.split(",");
return new Tuple2<String,Integer>(split[0] , Integer.parseInt(split[1]));
}
}).keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
@Override
public String getKey(Tuple2<String, Integer> rd) throws Exception {
return rd.f0;
}
}).reduce(new ReduceFunction<Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> agg, Tuple2<String, Integer> rd) throws Exception {
return (agg.f1 > rd.f1? agg : rd);
}
}).print("-------");
env.execute();
agg 始终是那个分区最大的值,如果有 input.f0 > agg.f0 大,则返回 input ,Flink 保存了 input 作为下次调用 reduce 的时候,出入 agg 的值。
下面讲讲 process 函数,此函数是一个全能选手,为什么这样说呢?因为它可以实现 aggregate 类函数、reduce 函数功能。在上面段落中已经将了 proces 的功能,这里不再重复说了。
下面要说的窗口函数了,窗口函数是可以缓存数据的函数,当 KeySream后面跟上一个窗口函数后,KeyStream 就变成了 WindowedStream 函数。窗口函数有下面几种:
这里先不讨论这些窗口的时候,先来看看跟在 WindowedStream 后面的有那些算子:
先来讲讲 process 吧,它在跟在 KeyStream 后面的时候,它接收的数据是一条条的,当它跟在 WindowedStream 后面的时候,它需要实现的 ProcessWindowFunction 接口,此接口的特点是,它接收的是一个 Iteratabl 入参,也就是它接收的是窗口中缓存的数据。而且它里面可以拿到 getRuntimeContext().getXXState() 的状态,可以在算子中使用 Flink 的状态。
下面是一个例子:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
Collection<String> list = new ArrayList<>();
list.add("void hello word word void void void void");
list.add("void snapshotState(FunctionSnapshotContext context) void void");
env.fromCollection(list)
.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
Arrays.stream(s.split("\\s+")).forEach(x -> {
collector.collect(new Tuple2<String, Integer>(x, 1));
});
}
}).keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
@Override
public String getKey(Tuple2<String, Integer> record) throws Exception {
return record.getField(0);
}
})
.countWindow(4L)
.process(new ProcessWindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, String, GlobalWindow>() {
private MapState<String,Integer> ms = null ;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
MapStateDescriptor<String,Integer> descriptor = new MapStateDescriptor("wc-ms"
, TypeInformation.of(String.class)
, TypeInformation.of(Integer.class)
);
ms = getRuntimeContext().getMapState(descriptor);
}
@Override
public void process(String key, ProcessWindowFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, String, GlobalWindow>.Context context
, Iterable<Tuple2<String, Integer>> iterable
, Collector<Tuple2<String, Integer>> collector) throws Exception {
iterable.forEach(x->{
try {
Integer cnt = ms.get(x.getField(0));
if(Objects.isNull(cnt)){
cnt = 0 ;
}else{
cnt = cnt + 1 ;
}
ms.put(x.getField(0),x.getField(1));
collector.collect(new Tuple2<>(x.getField(0) , cnt));
} catch (Exception e) {
e.printStackTrace();
}
});
}
}).print("-------");
env.execute();
apply 算子和 process 类似,功能上单薄了一点,它不可以在函数中使用 keyed state 。它可以继承 RichWindowFunction ,也可以实现 WindowFunction 接口,在 ProcessWindowFunction 中可以使用的 context 能取到各种的 state ,而 WindowFunction 中只能取到 window 相关的数据(例如窗口的开始和结束时间)等。
下面是一个例子:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
Collection<String> list = new ArrayList<>();
list.add("void hello word word void void void void");
list.add("void snapshotState(FunctionSnapshotContext context) void void");
env.fromCollection(list)
.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
Arrays.stream(s.split("\\s+")).forEach(x -> {
collector.collect(new Tuple2<String, Integer>(x, 1));
});
}
}).keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
@Override
public String getKey(Tuple2<String, Integer> record) throws Exception {
return record.getField(0);
}
})
.countWindow(4L)
.apply(new RichWindowFunction<Tuple2<String, Integer>, Tuple2<String,Integer> , String, GlobalWindow>() {
@Override
public void apply(String s, GlobalWindow globalWindow, Iterable<Tuple2<String, Integer>> iterable, Collector<Tuple2<String, Integer>> collector) throws Exception {
iterable.forEach(x->{
try {
Integer cnt = wc.get(x.getField(0));
cnt = (Objects.isNull(cnt)? 1 : cnt + 1);
wc.put(x.getField(0) , cnt);
} catch (Exception e) {
e.printStackTrace();
}
});
wc.entries().forEach(new Consumer<Map.Entry<String, Integer>>() {
@Override
public void accept(Map.Entry<String, Integer> entry) {
collector.collect(new Tuple2<String,Integer>(entry.getKey() , entry.getValue()) );
}
});
}
private transient MapState<String,Integer> wc = null ;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
MapStateDescriptor<String,Integer> descriptor
= new MapStateDescriptor<String, Integer>("wc"
,TypeInformation.of(String.class)
,TypeInformation.of(Integer.class)
);
wc = getRuntimeContext().getMapState(descriptor);
}
}
)
.print("------");
env.execute();
aggregate 类算子(sum、min、minBy、max、maxBy)跟在 WindowedStream 后面和跟在 KeyStream 后面都一样的。只不过,他们计算只是窗口中的数据,例如 ,sum 它计算的是窗口中的和,但是在 KeyedStream 中它计算就是分区中所有数据的和了。下面是代码例子。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Collection<String> list = new ArrayList<>();
list.add("ll aa c d e hello work word");
list.add("void snapshotState(FunctionSnapshotContext context) void");
env.fromCollection(list)
.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
Arrays.stream(s.split("\\s+")).forEach(x -> {
collector.collect(new Tuple2<String, Integer>(x, 1));
});
}
}).keyBy(x->x.getField(0))
.sum(1)
.keyBy(x->0)
// 将 min 换成 max,则可以计算出来最大的 word count 值
.min(1)
.print("-------");
env.execute();
还有一个 aggregate() 函数,它有两中实现,一个是 aggregate(AggregateFuctin|RichAggregateFunction , WindowFunction|RicheWindowFunctin),另外一个是 aggregate(AggregateFunction | RichAggregateFunction)。
先来看看 aggregate 行数的运行过程。
这是 aggregate 函数的运行过程,creaetAccumulator、add、merge、getResult 这些都是 AggregateFuctin|RichAggregateFunction 要实现的函数,creaetAccumulator 是初始化一个初始变量,然后通过 add 函数计算一部分数据的累计值,然后使用 merge 将多个 add 累计值加起来,最后使用 getResult() 将 merge 的结果返回。
AggregateFuctin|RichAggregateFunction 的结构最后还是要给WindowFunciton|RichWindowFunction 它们他们对窗口中的做最后的处理。如果使用的是 aggregate(AggregateFunction | RichAggregateFunction) 实现,其实WindowFunciton|RichWindowFunction 也是有的,是默认值,就做 PassThrougthWindowFunction.
下面是一个例子,
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
Collection<String> list = new ArrayList<>();
list.add("void hello word word void void void void");
list.add("void snapshotState(FunctionSnapshotContext context) void void");
env.fromCollection(list)
.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
Arrays.stream(s.split("\\s+")).forEach(x -> {
collector.collect(new Tuple2<String, Integer>(x, 1));
});
}
}).keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
@Override
public String getKey(Tuple2<String, Integer> record) throws Exception {
return record.getField(0);
}
})
.countWindow(4L)
.aggregate(new AggregateFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> createAccumulator() {
return null;
}
@Override
public Tuple2<String, Integer> add(Tuple2<String, Integer> in, Tuple2<String, Integer> acc) {
Tuple2<String, Integer> rs = null;
if (Objects.isNull(acc)) {
rs = in;
} else {
Integer field = (Integer) in.getField(1);
Integer field1 = (Integer) acc.getField(1);
rs = new Tuple2<String, Integer>(in.getField(0), field1 + field);
}
return rs;
}
@Override
public Tuple2<String, Integer> getResult(Tuple2<String, Integer> rs) {
return rs;
}
@Override
public Tuple2<String, Integer> merge(Tuple2<String, Integer> k1, Tuple2<String, Integer> k2) {
Tuple2<String, Integer> rs = new Tuple2<>();
if (!Objects.isNull(k1)) {
rs.setFields(k1.getField(0), k1.getField(1));
}
if (!Objects.isNull(k2)) {
Integer field = rs.getField(1);
Integer field1 = k2.getField(1);
field = (Objects.isNull(field) ? 0 : field);
rs.setFields(k1.getField(0), field + field1);
}
return rs;
}
}, new WindowFunction<Tuple2<String, Integer>, String, String, GlobalWindow>() {
@Override
public void apply(String s, GlobalWindow globalWindow, Iterable<Tuple2<String, Integer>> iterable, Collector<String> collector) throws Exception {
iterable.forEach(x->{
collector.collect(s + " : " + x.getField(1));
});
}
}).print("-----");
env.execute();
最后来到 reduce ,此 reduce 和跟在 KeyedStream 逻辑是相同的,但是他计算只是一个窗口中的数据,而不整个分区的数据。 reduce 中不能使用 RichReduceFunction,只能使用 ReduceFunction,除了 reduce(ReduceFunction) 之后,还有一个函数重载是 reduce(ReduceFunction , WindowFunction) 。和 aggregate 的逻辑相仿。下面是一个例子。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(10);
Collection<String> list = new ArrayList<>();
list.add("void hello word word void void void void");
list.add("void snapshotState(FunctionSnapshotContext context) void void");
env.fromCollection(list)
.flatMap(new RichFlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) throws Exception {
Arrays.stream(s.split("\\s+")).forEach(x -> {
collector.collect(new Tuple2<String, Integer>(x, 1));
});
}
}).keyBy(new KeySelector<Tuple2<String, Integer>, String>() {
@Override
public String getKey(Tuple2<String, Integer> record) throws Exception {
return record.getField(0);
}
})
.countWindow(4L).reduce(new ReduceFunction<Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> in, Tuple2<String, Integer> acc) throws Exception {
if (Objects.isNull(acc)) {
return in;
} else {
Integer sum = (Integer) in.getField(1) + (Integer) acc.getField(1);
return new Tuple2<>(in.getField(0), sum);
}
}
}, new WindowFunction<Tuple2<String, Integer>, Tuple2<String,Integer>, String, GlobalWindow>() {
@Override
public void apply(String s, GlobalWindow globalWindow, Iterable<Tuple2<String, Integer>> iterable, Collector<Tuple2<String, Integer>> collector) throws Exception {
for (Tuple2<String, Integer> stringIntegerTuple2 : iterable) {
collector.collect(stringIntegerTuple2);
}
}
}).print("------------");
env.execute();
reduce(ReduceFunction , WindowFunction) 方法的执行顺序是,当窗口被触发之后,先将 window 中的元素都拿出来,调用 reduce 方法,然后将结果传递给 WindowFunction##apply 的 Iterator ,所以在例子中,我们只能拿到 reduce 的执行结果。