目录
3.1 Map
3.2 FlatMap
3.3 Filter
3.4 KeyBy
3.5 Reduce
3.6 Fold
3.7 Aggregations
3.8 Window
3.9 WindowAll
4.0 Aggregations on windows
4.1 Union
4.2 Split
4.3 select
DataStream → DataStream
一对一转换,即输入的记录和输出的记录相等。
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class TransformationsMap {
public static void main(String []arv) throws Exception
{
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream dsSocket=env.socketTextStream("192.168.23.210",9000);
//函数式
DataStream dsMap1=dsSocket.map(new MapFunction>() {
@Override
public Tuple2 map(String vlaue) throws Exception {
Tuple2 word=Tuple2.of(vlaue,1);
return word;
}
});
//dsMap1.print();
//Lamda方式
DataStream dsMap2=dsSocket.map(value->Tuple2.of(value,1));
dsMap2.print();
env.execute("TransformationsMap");
}
}
输入:
输出:
一行变零到多行,即输入一行,输出0行或多行。
DataStream → DataStream
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
public class TransformationsFlatmap {
public static void main(String []arv) throws Exception
{
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream dsSocket=env.socketTextStream("192.168.23.210",9000);
//函数式
//输入spark,hive,hbase
DataStream ds1=dsSocket.flatMap(new FlatMapFunction>() {
@Override
public void flatMap(String value, Collector> collector) throws Exception {
String [] words=value.split(",");
for(String word:words){
collector.collect(Tuple2.of(word,1));
}
}
});
ds1.print();
env.execute("TransformationsMap");
}
}
|
DataStream → DataStream
通过一个布尔函数对每一个元素进行判断,返回符合条件的元素。
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class TransformationsFilter {
public static void main(String []arv) throws Exception
{
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream dsSocket=env.socketTextStream("192.168.23.210",9000);
//函数式
DataStream ds1=dsSocket.filter(new FilterFunction() {
@Override
public boolean filter(String value) throws Exception {
boolean b=value.equalsIgnoreCase("spark");
return b;
}
});
ds1.print();
env.execute("TransformationsMap");
}
}
|
DataStream → KeyedStream
在逻辑上将一个流分成不相交的分区。具有相同键的所有记录都分配给同一个分区。在内部,keyBy在内部是通过哈希分区实现的。
注意:
public class Person {
private String id;
private int sum;
public Person(){};
public Person(String id, int sum) {
this.id = id;
this.sum = sum;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public int getSum() {
return sum;
}
public void setSum(int sum) {
this.sum = sum;
}
}
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.Arrays;
public class TransformationsBeyBy {
public static void main(String []arv) throws Exception
{
StreamExecutionEnvironment env1=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream dsCollection=env.fromCollection(Arrays.asList(
new Person("a",1),
new Person("a",2),
new Person("b",1),
new Person("a",2)
));
KeyedStream ksCollection1=dsCollection.keyBy(new KeySelector() {
@Override
public String getKey(Person p) throws Exception {
return p.getId();
}
});
DataStream dsSum=ksCollection1.sum("sum");
dsSum.print();
env.execute("TransformationsMap");
}
}
|
KeyedStream → DataStream
将当前元素与上一个减少的值合并并发出新值,是一个有状态的算子。即基于ReduceFunction进行滚动聚合,并向下游算子输出每次滚动聚合后的结果。
public class Person {
private String id;
private int sum;
public Person(){};
public Person(String id, int sum) {
this.id = id;
this.sum = sum;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public int getSum() {
return sum;
}
public void setSum(int sum) {
this.sum = sum;
}
@Override
public String toString() {
return "Person{" +
"id='" + id + '\'' +
", sum=" + sum +
'}';
}
}
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.Arrays;
public class TransformationsReduce {
public static void main(String []arv) throws Exception
{StreamExecutionEnvironment env1=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream dsCollection=env.fromCollection(Arrays.asList(
new Person("a",1),
new Person("a",2),
new Person("b",1),
new Person("a",2)
));
KeyedStream keyStream01=dsCollection.keyBy(new KeySelector() {
@Override
public String getKey(Person p) throws Exception {
return p.getId();
}
});
DataStream dsReduce=keyStream01.reduce(new ReduceFunction() {
@Override
public Person reduce(Person person, Person t1) throws Exception {
int sum=person.getSum()+t1.getSum();
return new Person(person.getId(),sum);
}
});
dsReduce.print();
env.execute("TransformationsMap");
}
}
KeyedStream → DataStream
具有初始值的键控数据流上的“滚动”折叠。将当前元素与最后折叠的值合并,并输出新值。
这么理解呢?fold是组内的每个元素与累加器(一开始是初始值initialValue)合并再返回累加器,累加器的类型可以与组内的元素类型不一致;和reduce相似,只是多了一个初始参数。
public class Person {
private String id;
private int sum;
public Person(){};
public Person(String id, int sum) {
this.id = id;
this.sum = sum;
}
public String getId() {
return id;
}
public void setId(String id) {
this.id = id;
}
public int getSum() {
return sum;
}
public void setSum(int sum) {
this.sum = sum;
}
@Override
public String toString() {
return "Person{" +
"id='" + id + '\'' +
", sum=" + sum +
'}';
}
}
import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.Arrays;
public class TransformationsFold {
public static void main(String []arv) throws Exception
{
StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
//通过KeySelector指定key
StreamExecutionEnvironment env1=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream dsCollection=env.fromCollection(Arrays.asList(
new Person("a",1),
new Person("a",2),
new Person("b",1),
new Person("a",2)
));
KeyedStream keyStream01=dsCollection.keyBy(new KeySelector() {
@Override
public String getKey(Person p) throws Exception {
return p.getId();
}
});
DataStream dsFold=keyStream01.fold("start", new FoldFunction() {
@Override
public String fold(String current, Person p) throws Exception {
return current + "-" + p.getId();
}
});
dsFold.print();
env.execute("TransformationsMap");
}
}
输出:
6> start-a
6> start-a-a
2> start-b
6> start-a-a-a
KeyedStream → DataStream
在KeyedStream上进行滚动聚合;min和minBy之间的区别是min返回最小值,而minBy返回在此字段中具有最小值的元素(与max和maxBy相同)。
即min会根据指定的字段取最小值,并且把这个值保存在对应的位置上,对于其他的字段取了最先获取的值,不能保证每个元素的数值正确,max同理。而minBy会返回指定字段取最小值的元素,并且会覆盖指定字段小于当前已找到的最小值元素。maxBy同理。
主要的聚合方法:
keyedStream.sum(0);
keyedStream.sum("key");
keyedStream.min(0);
keyedStream.min("key");
keyedStream.max(0);
keyedStream.max("key");
keyedStream.minBy(0);
keyedStream.minBy("key");
keyedStream.maxBy(0);
keyedStream.maxBy("key");
实例:
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
public class TransformationsAggregations {
public static void main(String []arv) throws Exception
{
StreamExecutionEnvironment env1=StreamExecutionEnvironment.getExecutionEnvironment();
//获取数据源
List data = new ArrayList>();
data.add(new Tuple3<>(0,2,2));
data.add(new Tuple3<>(0,1,1));
data.add(new Tuple3<>(0,5,6));
data.add(new Tuple3<>(0,3,5));
data.add(new Tuple3<>(1,1,9));
data.add(new Tuple3<>(1,2,8));
data.add(new Tuple3<>(1,3,10));
data.add(new Tuple3<>(1,2,9));
DataStreamSource> items = env1.fromCollection(data);
//items.keyBy(0).min(2).print();
/* min输出
6> (0,2,2)
6> (0,2,1)
6> (0,2,1)
6> (0,2,1)
6> (1,1,9)
6> (1,1,8)
6> (1,1,8)
6> (1,1,8)*/
items.keyBy(0).minBy(2).print();
/* minBy输出
6> (0,2,2)
6> (0,1,1)
6> (0,1,1)
6> (0,1,1)
6> (1,1,9)
6> (1,2,8)
6> (1,2,8)
6> (1,2,8)*/
env1.execute("TransformationsAggregations");
}
}
|
KeyedStream → WindowedStream
把已经按key分区的KeyedStream,按window定义的窗口进行输出,可以设置平行度。
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
public class TransformationsWindow
{
public static void main(String [] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream text = env.socketTextStream("192.168.23.210",9000);
DataStream ds1 = text.map(new MapFunction>() {
@Override
public Tuple2 map(String s) throws Exception {
return Tuple2.of(s,1);
}
});
WindowedStream ds2 = ds1.keyBy(0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)));
ds2.sum(1).print();
env.execute();
}
}
3.9 WindowAll
DataStream → AllWindowedStream
把数据流按windows指定的窗口大小输出。这个窗口可以的数据流可以说是DataStream的也可以是KeyedStream,但是不能设置平行度,并行度始终为1。
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.AllWindowedStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingProcessingTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
public class TransformationsWindowAll
{
public static void main(String [] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream text = env.socketTextStream("192.168.23.210",9000);
//text.print();
DataStream ds1 = text.map(new MapFunction>() {
@Override
public Tuple2 map(String s) throws Exception {
return Tuple2.of(s,1);
}
});
//WindowAll用于DataStream
AllWindowedStream ts = ds1.timeWindowAll(Time.seconds(5));
ts.sum(1).print();
//WindowAll 用于keyBy
AllWindowedStream ds2 = ds1.keyBy(0).windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)));
ds2.sum(1).print();
env.execute();
}
}
WindowedStream → DataStream
Windows的聚合函数和普通聚合函数类似,区别在于窗口聚合函数聚合时只统计窗口内的数据。
常用窗口聚合函数:
DataStream* → DataStream
把多个数据流合并成一个数据流。
DataStream ds3=ds1.union(ds2,ds3,…);
DataStream → SplitStream
把一个数据流拆分成多个数据流;
SplitStream @Override public Iterable List if (value % 2 == 0) { output.add("even"); } else { output.add("odd"); } return output; } });
SplitStream DataStream DataStream DataStream |
SplitStream → DataStream
从SplitStream获取DataStream