DataStream是flink实时流处理的基本数据模型,DataSet是flink批处理的数据模型。本文主要介绍DataStream,在flink的实时流处理中,所有的流对象都会继承DataStrem这个类。DataStream在实际转换(算子)处理中也会被处理成下面的五个流对象,这几个流对象除了拥有共同的方法外还有自己独有的方法,下面将一一介绍 DataSteam 及其子类的所有API该如何使用。
dataStream.map(new MapFunction() {
@Override
public Integer map(Integer value) throws Exception {
return 2 * value;
}
});
dataStream.flatMap(new FlatMapFunction() {
@Override
public void flatMap(String value, Collector out)
throws Exception {
for(String word: value.split(" ")){
out.collect(word);
}
}
});
dataStream.filter(new FilterFunction() {
@Override
public boolean filter(Integer value) throws Exception {
return value != 0;
}
});
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data
dataStream.union(otherStream1, otherStream2, ...);
dataStream.join(otherStream)
.where().equalTo()
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.apply (new JoinFunction () {...});
dataStream.coGroup(otherStream)
.where(0).equalTo(1)
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.apply (new CoGroupFunction () {...});
Transformation:生成一个TaggedUnion类型和unionKeySelector,里面分别包含了两个流的元素类型和两个流的KeySelector。将两个流通过map分别输出为类型是TaggedUnion的两个流(map详情见StreamMap),再Union在一起(详情见Union),再使用合并过后的流和unionKeySelector生成一个KeyedStream(详情见KeyBy),最后使用KeyedStream的window方法并传入WindowAssigner生成WindowedStream,并apply CoGroupFunction来处理(详情见WindowedStream Apply方法)。总体来说,Flink对这个方法做了很多内部的转换,最后生成了两个StreamMapTransformation,一个PartitionTransformation和一个包含了WindowOperator的OneInputTransformation。
CoGroupTransformation
Runtime:参考每个Transformation对应的Runtime情况
DataStream someStream = //...
DataStream otherStream = //...
ConnectedStreams connectedStreams = someStream.connect(otherStream);
SplitStream split = someDataStream.split(new OutputSelector() {
@Override
public Iterable select(Integer value) {
List output = new ArrayList();
if (value % 2 == 0) {
output.add("even");
}
else {
output.add("odd");
}
return output;
}
});
IterativeStream iteration = initialStream.iterate();
DataStream iterationBody = iteration.map (/*do something*/);
DataStream feedback = iterationBody.filter(new FilterFunction(){
@Override
public boolean filter(Integer value) throws Exception {
return value > 0;
}
});
iteration.closeWith(feedback);
DataStream output = iterationBody.filter(new FilterFunction(){
@Override
public boolean filter(Integer value) throws Exception {
return value <= 0;
}
});
stream.assignTimestamps (new TimeStampExtractor() {...});
DataStream> in = // [...]
DataStream> out = in.project(2,0);
dataStream.partitionCustom(partitioner, "someKey");
dataStream.partitionCustom(partitioner, 0);
通过用户定义的流分区器(Partitioner)将每个元素传输到指定的subtask
Transformation:partitionCustom类似于KeyBy,不过partitioner是由自己定制并且输出的不是KeyedStream。首先会通过KeySelector和用户实现的Partitioner生成一个CustomPartitionerWrapper(StreamPartitioner),再讲它注入到PartitionTransformation。
dataStream.shuffle();
dataStream.rebalance();
dataStream.rescale();
dataStream.broadcast();
keyedStream.reduce(new ReduceFunction() {
@Override
public Integer reduce(Integer value1, Integer value2)
throws Exception {
return value1 + value2;
}
});
根据ReduceFunction将元素与上一个reduce后的结果合并,产出合并之后的结果。
DataStream result =
keyedStream.fold("start", new FoldFunction() {
@Override
public String fold(String current, Integer value) {
return current + "-" + value;
}
});
keyedStream.sum(0);
keyedStream.sum("key");
keyedStream.min(0);
keyedStream.min("key");
keyedStream.max(0);
keyedStream.max("key");
keyedStream.minBy(0);
keyedStream.minBy("key");
keyedStream.maxBy(0);
keyedStream.maxBy("key");
dataStream.window(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data
Interval Join
// this will join the two streams so that
// key1 == key2 && leftTs - 2 < rightTs < leftTs + 2
keyedStream.intervalJoin(otherKeyedStream)
.between(Time.milliseconds(-2), Time.milliseconds(2)) // lower and upper bound
.upperBoundExclusive(true) // optional
.lowerBoundExclusive(true) // optional
.process(new IntervalJoinFunction() {...});
windowedStream.apply (new WindowFunction, Integer, Tuple, Window>() {
public void apply (Tuple tuple,
Window window,
Iterable> values,
Collector out) throws Exception {
int sum = 0;
for (value t: values) {
sum += t.f1;
}
out.collect (new Integer(sum));
}
});
windowedStream.reduce (new ReduceFunction>() {
public Tuple2 reduce(Tuple2 value1, Tuple2 value2) throws Exception {
return new Tuple2(value1.f0, value1.f1 + value2.f1);
}
});
windowedStream.sum(0);
windowedStream.sum("key");
windowedStream.min(0);
windowedStream.min("key");
windowedStream.max(0);
windowedStream.max("key");
windowedStream.minBy(0);
windowedStream.minBy("key");
windowedStream.maxBy(0);
windowedStream.maxBy("key");
// applying an AllWindowFunction on non-keyed window stream
allWindowedStream.apply (new AllWindowFunction, Integer, Window>() {
public void apply (Window window,
Iterable> values,
Collector out) throws Exception {
int sum = 0;
for (value t: values) {
sum += t.f1;
}
out.collect (new Integer(sum));
}
});
Transformation:AllWindowedStream.apply()与WindowedStream.apply()基本是一致的,只是没有KeySelector
Runtime:通WindowedStream.apply()
connectedStreams.map(new CoMapFunction() {
@Override
public Boolean map1(Integer value) {
return true;
}
@Override
public Boolean map2(String value) {
return false;
}
});
connectedStreams.flatMap(new CoFlatMapFunction() {
@Override
public void flatMap1(Integer value, Collector out) {
out.collect(value.toString());
}
@Override
public void flatMap2(String value, Collector out) {
for (String word: value.split(" ")) {
out.collect(word);
}
}
});
Transformation:ConnectedStream并不会产生Transformation,只会保存两个Input DataStream,从inputs中的DataStream获取父Transformation,并生成一个CoStream(Flat)Map算子。KeySelector依赖于父Transformation注入(如果是PartitionTransformation的话)。
SplitStream split;
DataStream even = split.select("even");
DataStream odd = split.select("odd");
DataStream all = split.select("even","odd");
扫一扫加入大数据技术交流群,了解更多大数据技术,还有免费资料等你哦
扫一扫加入大数据技术交流群,了解更多大数据技术,还有免费资料等你哦
扫一扫加入大数据技术交流群,了解更多大数据技术,还有免费资料等你哦