1、Flink 专栏等系列综合文章链接
本文主要介绍Flink 的10种常用的operator(window、distinct、join等)及以具体可运行示例进行说明.
如果需要了解更多内容,可以在本人Flink 专栏中了解更新系统的内容。
本文除了maven依赖外,没有其他依赖。
本专题分为五篇,即:
【flink番外篇】1、flink的23种常用算子介绍及详细示例(1)- map、flatmap和filter
【flink番外篇】1、flink的23种常用算子介绍及详细示例(2)- keyby、reduce和Aggregations
【flink番外篇】1、flink的23种常用算子介绍及详细示例(3)-window、distinct、join等
【flink番外篇】1、flink的23种常用算子介绍及详细示例(4)- union、window join、connect、outputtag、cache、iterator、project
【flink番外篇】1、flink的23种常用算子介绍及详细示例(完整版)
本文示例中使用的maven依赖和java bean 参考本专题的第一篇中的maven和java bean。
具体事例详见例子及结果。
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.common.operators.Order;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import org.datastreamapi.User;
/**
* @author alanchan
*
*/
public class TestFirst_Join_Distinct_OutJoin_CrossDemo {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
joinFunction(env);
env.execute();
}
public static void unionFunction(StreamExecutionEnvironment env) throws Exception {
List<String> info1 = new ArrayList<>();
info1.add("team A");
info1.add("team B");
List<String> info2 = new ArrayList<>();
info2.add("team C");
info2.add("team D");
List<String> info3 = new ArrayList<>();
info3.add("team E");
info3.add("team F");
List<String> info4 = new ArrayList<>();
info4.add("team G");
info4.add("team H");
DataStream<String> source1 = env.fromCollection(info1);
DataStream<String> source2 = env.fromCollection(info2);
DataStream<String> source3 = env.fromCollection(info3);
DataStream<String> source4 = env.fromCollection(info4);
source1.union(source2).union(source3).union(source4).print();
// team A
// team C
// team E
// team G
// team B
// team D
// team F
// team H
}
public static void crossFunction(ExecutionEnvironment env) throws Exception {
// cross,求两个集合的笛卡尔积,得到的结果数为:集合1的条数 乘以 集合2的条数
List<String> info1 = new ArrayList<>();
info1.add("team A");
info1.add("team B");
List<Tuple2<String, Integer>> info2 = new ArrayList<>();
info2.add(new Tuple2("W", 3));
info2.add(new Tuple2("D", 1));
info2.add(new Tuple2("L", 0));
DataSource<String> data1 = env.fromCollection(info1);
DataSource<Tuple2<String, Integer>> data2 = env.fromCollection(info2);
data1.cross(data2).print();
// (team A,(W,3))
// (team A,(D,1))
// (team A,(L,0))
// (team B,(W,3))
// (team B,(D,1))
// (team B,(L,0))
}
public static void outerJoinFunction(ExecutionEnvironment env) throws Exception {
// Outjoin,跟sql语句中的left join,right join,full join意思一样
// leftOuterJoin,跟join一样,但是左边集合的没有关联上的结果也会取出来,没关联上的右边为null
// rightOuterJoin,跟join一样,但是右边集合的没有关联上的结果也会取出来,没关联上的左边为null
// fullOuterJoin,跟join一样,但是两个集合没有关联上的结果也会取出来,没关联上的一边为null
List<Tuple2<Integer, String>> info1 = new ArrayList<>();
info1.add(new Tuple2<>(1, "shenzhen"));
info1.add(new Tuple2<>(2, "guangzhou"));
info1.add(new Tuple2<>(3, "shanghai"));
info1.add(new Tuple2<>(4, "chengdu"));
List<Tuple2<Integer, String>> info2 = new ArrayList<>();
info2.add(new Tuple2<>(1, "深圳"));
info2.add(new Tuple2<>(2, "广州"));
info2.add(new Tuple2<>(3, "上海"));
info2.add(new Tuple2<>(5, "杭州"));
DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
// left join
// eft join:7> (1,shenzhen,深圳)
// left join:2> (3,shanghai,上海)
// left join:8> (4,chengdu,未知)
// left join:16> (2,guangzhou,广州)
data1.leftOuterJoin(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
@Override
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
Tuple3<Integer, String, String> tuple = new Tuple3();
if (second == null) {
tuple.setField(first.f0, 0);
tuple.setField(first.f1, 1);
tuple.setField("未知", 2);
} else {
// 另外一种赋值方式,和直接用构造函数赋值相同
tuple.setField(first.f0, 0);
tuple.setField(first.f1, 1);
tuple.setField(second.f1, 2);
}
return tuple;
}
}).print("left join");
// right join
// right join:2> (3,shanghai,上海)
// right join:7> (1,shenzhen,深圳)
// right join:15> (5,--,杭州)
// right join:16> (2,guangzhou,广州)
data1.rightOuterJoin(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
@Override
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
Tuple3<Integer, String, String> tuple = new Tuple3();
if (first == null) {
tuple.setField(second.f0, 0);
tuple.setField("--", 1);
tuple.setField(second.f1, 2);
} else {
// 另外一种赋值方式,和直接用构造函数赋值相同
tuple.setField(first.f0, 0);
tuple.setField(first.f1, 1);
tuple.setField(second.f1, 2);
}
return tuple;
}
}).print("right join");
// fullOuterJoin
// fullOuterJoin:2> (3,shanghai,上海)
// fullOuterJoin:8> (4,chengdu,--)
// fullOuterJoin:15> (5,--,杭州)
// fullOuterJoin:16> (2,guangzhou,广州)
// fullOuterJoin:7> (1,shenzhen,深圳)
data1.fullOuterJoin(data2).where(0).equalTo(0).with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
@Override
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
Tuple3<Integer, String, String> tuple = new Tuple3();
if (second == null) {
tuple.setField(first.f0, 0);
tuple.setField(first.f1, 1);
tuple.setField("--", 2);
} else if (first == null) {
tuple.setField(second.f0, 0);
tuple.setField("--", 1);
tuple.setField(second.f1, 2);
} else {
// 另外一种赋值方式,和直接用构造函数赋值相同
tuple.setField(first.f0, 0);
tuple.setField(first.f1, 1);
tuple.setField(second.f1, 2);
}
return tuple;
}
}).print("fullOuterJoin");
}
public static void joinFunction(ExecutionEnvironment env) throws Exception {
List<Tuple2<Integer, String>> info1 = new ArrayList<>();
info1.add(new Tuple2<>(1, "shenzhen"));
info1.add(new Tuple2<>(2, "guangzhou"));
info1.add(new Tuple2<>(3, "shanghai"));
info1.add(new Tuple2<>(4, "chengdu"));
List<Tuple2<Integer, String>> info2 = new ArrayList<>();
info2.add(new Tuple2<>(1, "深圳"));
info2.add(new Tuple2<>(2, "广州"));
info2.add(new Tuple2<>(3, "上海"));
info2.add(new Tuple2<>(5, "杭州"));
DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(info1);
DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(info2);
//
// join:2> ((3,shanghai),(3,上海))
// join:16> ((2,guangzhou),(2,广州))
// join:7> ((1,shenzhen),(1,深圳))
data1.join(data2).where(0).equalTo(0).print("join");
// join2:2> (3,上海,shanghai)
// join2:7> (1,深圳,shenzhen)
// join2:16> (2,广州,guangzhou)
DataSet<Tuple3<Integer, String, String>> data3 = data1.join(data2).where(0).equalTo(0)
.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
@Override
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second) throws Exception {
return new Tuple3<Integer, String, String>(first.f0, second.f1, first.f1);
}
});
data3.print("join2");
}
public static void firstFunction(ExecutionEnvironment env) throws Exception {
List<Tuple2<Integer, String>> info = new ArrayList<>();
info.add(new Tuple2(1, "Hadoop"));
info.add(new Tuple2(1, "Spark"));
info.add(new Tuple2(1, "Flink"));
info.add(new Tuple2(2, "Scala"));
info.add(new Tuple2(2, "Java"));
info.add(new Tuple2(2, "Python"));
info.add(new Tuple2(3, "Linux"));
info.add(new Tuple2(3, "Window"));
info.add(new Tuple2(3, "MacOS"));
DataSet<Tuple2<Integer, String>> dataSet = env.fromCollection(info);
// 前几个
// dataSet.first(4).print();
// (1,Hadoop)
// (1,Spark)
// (1,Flink)
// (2,Scala)
// 按照tuple2的第一个元素进行分组,查出每组的前2个
// dataSet.groupBy(0).first(2).print();
// (3,Linux)
// (3,Window)
// (1,Hadoop)
// (1,Spark)
// (2,Scala)
// (2,Java)
// 按照tpule2的第一个元素进行分组,并按照倒序排列,查出每组的前2个
dataSet.groupBy(0).sortGroup(1, Order.DESCENDING).first(2).print();
// (3,Window)
// (3,MacOS)
// (1,Spark)
// (1,Hadoop)
// (2,Scala)
// (2,Python)
}
public static void distinctFunction(ExecutionEnvironment env) throws Exception {
List list = new ArrayList<Tuple3<Integer, Integer, Integer>>();
list.add(new Tuple3<>(0, 3, 6));
list.add(new Tuple3<>(0, 2, 5));
list.add(new Tuple3<>(0, 3, 6));
list.add(new Tuple3<>(1, 1, 9));
list.add(new Tuple3<>(1, 2, 8));
list.add(new Tuple3<>(1, 2, 8));
list.add(new Tuple3<>(1, 3, 9));
DataSet<Tuple3<Integer, Integer, Integer>> source = env.fromCollection(list);
// 去除tuple3中元素完全一样的
source.distinct().print();
// (1,3,9)
// (0,3,6)
// (1,1,9)
// (1,2,8)
// (0,2,5)
// 去除tuple3中第一个元素一样的,只保留第一个
// source.distinct(0).print();
// (1,1,9)
// (0,3,6)
// 去除tuple3中第一个和第三个相同的元素,只保留第一个
// source.distinct(0,2).print();
// (0,3,6)
// (1,1,9)
// (1,2,8)
// (0,2,5)
}
public static void distinctFunction2(ExecutionEnvironment env) throws Exception {
DataSet<User> source = env.fromCollection(Arrays.asList(new User(1, "alan1", "1", "[email protected]", 18, 3000), new User(2, "alan2", "2", "[email protected]", 19, 200),
new User(3, "alan1", "3", "[email protected]", 18, 1000), new User(5, "alan1", "5", "[email protected]", 28, 1500), new User(4, "alan2", "4", "[email protected]", 20, 300)));
// source.distinct("name").print();
// User(id=2, name=alan2, pwd=2, [email protected], age=19, balance=200.0)
// User(id=1, name=alan1, pwd=1, [email protected], age=18, balance=3000.0)
source.distinct("name", "age").print();
// User(id=1, name=alan1, pwd=1, [email protected], age=18, balance=3000.0)
// User(id=2, name=alan2, pwd=2, [email protected], age=19, balance=200.0)
// User(id=5, name=alan1, pwd=5, [email protected], age=28, balance=1500.0)
// User(id=4, name=alan2, pwd=4, [email protected], age=20, balance=300.0)
}
public static void distinctFunction3(ExecutionEnvironment env) throws Exception {
DataSet<User> source = env.fromCollection(Arrays.asList(new User(1, "alan1", "1", "[email protected]", 18, -1000), new User(2, "alan2", "2", "[email protected]", 19, 200),
new User(3, "alan1", "3", "[email protected]", 18, -1000), new User(5, "alan1", "5", "[email protected]", 28, 1500), new User(4, "alan2", "4", "[email protected]", 20, -300)));
// 针对balance增加绝对值去重
source.distinct(new KeySelector<User, Double>() {
@Override
public Double getKey(User value) throws Exception {
return Math.abs(value.getBalance());
}
}).print();
// User(id=5, name=alan1, pwd=5, [email protected], age=28, balance=1500.0)
// User(id=2, name=alan2, pwd=2, [email protected], age=19, balance=200.0)
// User(id=1, name=alan1, pwd=1, [email protected], age=18, balance=-1000.0)
// User(id=4, name=alan2, pwd=4, [email protected], age=20, balance=-300.0)
}
public static void distinctFunction4(ExecutionEnvironment env) throws Exception {
List<String> info = new ArrayList<>();
info.add("Hadoop,Spark");
info.add("Spark,Flink");
info.add("Hadoop,Flink");
info.add("Hadoop,Flink");
DataSet<String> source = env.fromCollection(info);
source.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
System.err.print("come in ");
for (String token : value.split(",")) {
out.collect(token);
}
}
});
source.distinct().print();
}
}
KeyedStream → WindowedStream
Window 函数允许按时间或其他条件对现有 KeyedStream 进行分组。 以下是以 10 秒的时间窗口聚合:
inputStream.keyBy(0).window(Time.seconds(10));
Flink 定义数据片段以便(可能)处理无限数据流。 这些切片称为窗口。 此切片有助于通过应用转换处理数据块。 要对流进行窗口化,需要分配一个可以进行分发的键和一个描述要对窗口化流执行哪些转换的函数。要将流切片到窗口,可以使用 Flink 自带的窗口分配器。 我们有选项,如 tumbling windows, sliding windows, global 和 session windows。
具体参考系列文章
6、Flink四大基石之Window详解与详细示例(一)
6、Flink四大基石之Window详解与详细示例(二)
7、Flink四大基石之Time和WaterMaker详解与详细示例(watermaker基本使用、kafka作为数据源的watermaker使用示例以及超出最大允许延迟数据的接收实现)
DataStream → AllWindowedStream
windowAll 函数允许对常规数据流进行分组。 通常,这是非并行数据转换,因为它在非分区数据流上运行。
与常规数据流功能类似,也有窗口数据流功能。 唯一的区别是它们处理窗口数据流。 所以窗口缩小就像 Reduce 函数一样,Window fold 就像 Fold 函数一样,并且还有聚合。
dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))); // Last 5 seconds of data
这适用于非并行转换的大多数场景。所有记录都将收集到 windowAll 算子对应的一个任务中。
具体参考系列文章
6、Flink四大基石之Window详解与详细示例(一)
6、Flink四大基石之Window详解与详细示例(二)
7、Flink四大基石之Time和WaterMaker详解与详细示例(watermaker基本使用、kafka作为数据源的watermaker使用示例以及超出最大允许延迟数据的接收实现)
WindowedStream → DataStream
AllWindowedStream → DataStream
将通用 function 应用于整个窗口。下面是一个手动对窗口内元素求和的 function。
如果你使用 windowAll 转换,则需要改用 AllWindowFunction。
windowedStream.apply(new WindowFunction<Tuple2<String,Integer>, Integer, Tuple, Window>() {
public void apply (Tuple tuple,
Window window,
Iterable<Tuple2<String, Integer>> values,
Collector<Integer> out) throws Exception {
int sum = 0;
for (value t: values) {
sum += t.f1;
}
out.collect (new Integer(sum));
}
});
// 在 non-keyed 窗口流上应用 AllWindowFunction
allWindowedStream.apply (new AllWindowFunction<Tuple2<String,Integer>, Integer, Window>() {
public void apply (Window window,
Iterable<Tuple2<String, Integer>> values,
Collector<Integer> out) throws Exception {
int sum = 0;
for (value t: values) {
sum += t.f1;
}
out.collect (new Integer(sum));
}
});
WindowedStream → DataStream
对窗口应用 reduce function 并返回 reduce 后的值。
windowedStream.reduce (new ReduceFunction<Tuple2<String,Integer>>() {
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
return new Tuple2<String,Integer>(value1.f0, value1.f1 + value2.f1);
}
});
WindowedStream → DataStream
聚合窗口的内容。min和minBy之间的区别在于,min返回最小值,而minBy返回该字段中具有最小值的元素(max和maxBy相同)。
windowedStream.sum(0);
windowedStream.sum("key");
windowedStream.min(0);
windowedStream.min("key");
windowedStream.max(0);
windowedStream.max("key");
windowedStream.minBy(0);
windowedStream.minBy("key");
windowedStream.maxBy(0);
windowedStream.maxBy("key");
以上,本文主要介绍Flink 的10种常用的operator(window、distinct、join等)及以具体可运行示例进行说明.
如果需要了解更多内容,可以在本人Flink 专栏中了解更新系统的内容。
本专题分为五篇,即:
【flink番外篇】1、flink的23种常用算子介绍及详细示例(1)- map、flatmap和filter
【flink番外篇】1、flink的23种常用算子介绍及详细示例(2)- keyby、reduce和Aggregations
【flink番外篇】1、flink的23种常用算子介绍及详细示例(3)-window、distinct、join等
【flink番外篇】1、flink的23种常用算子介绍及详细示例(4)- union、window join、connect、outputtag、cache、iterator、project
【flink番外篇】1、flink的23种常用算子介绍及详细示例(完整版)