在Flink中由很多Operator,大致分为两类,DataStream和DataSet(还有高层的Table&SQL API),即流处理数据和批处理数据,这篇文章主要讲解CoGroup、Join和Connect的使用,之所以将它们放在一起是因为它们比较相似,但也有所不同,在DataStream和DataSet中都存在CoGroup、Join这两个Operator,而Connect只适用于处理DataStream,下面我们将分别讲述。
该操作是将两个数据流/集合按照key进行group,然后将相同key的数据进行处理,但是它和join操作稍有区别,它在一个流/数据集中没有找到与另一个匹配的数据还是会输出。
1.在DataStream中
下面看一个简单的例子,这个例子中从两个不同的端口来读取数据,模拟两个流,我们使用CoGroup来处理这两个数据流,观察输出结果:
public class CogroupFunctionDemo02 {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
DataStream> input1=env.socketTextStream("192.168.217.110",9002)
.map(new MapFunction>() {
@Override
public Tuple2 map(String s) throws Exception {
return Tuple2.of(s.split(" ")[0],s.split(" ")[1]);
}
});
DataStream> input2=env.socketTextStream("192.168.217.110",9001)
.map(new MapFunction>() {
@Override
public Tuple2 map(String s) throws Exception {
return Tuple2.of(s.split(" ")[0],s.split(" ")[1]);
}
});
input1.coGroup(input2)
.where(new KeySelector, Object>() {
@Override
public Object getKey(Tuple2 value) throws Exception {
return value.f0;
}
}).equalTo(new KeySelector, Object>() {
@Override
public Object getKey(Tuple2 value) throws Exception {
return value.f0;
}
}).window(ProcessingTimeSessionWindows.withGap(Time.seconds(3)))
.trigger(CountTrigger.of(1))
.apply(new CoGroupFunction, Tuple2, Object>() {
@Override
public void coGroup(Iterable> iterable, Iterable> iterable1, Collector
首先启动两个终端窗口,然后使用nc工具打开两个端口,然后运行上面程序:
[shinelon@hadoop-senior Desktop]$ nc -lk 9001
1 lj
1 al
2 af
[shinelon@hadoop-senior Desktop]$ nc -lk 9002
2 ac
1 ao
2 14
运行 结果如下所示:
2> DataStream frist:
2=>ac
DataStream second:
4> DataStream frist:
DataStream second:
1=>lj
4> DataStream frist:
1=>ao
DataStream second:
4> DataStream frist:
DataStream second:
1=>al
2> DataStream frist:
2=>14
DataStream second:
2> DataStream frist:
2=>14
DataStream second:
2=>af
2.在DataSet中
下面的例子中,key代表学生班级ID,value为学生name,使用cogroup操作将两个集合中key相同数据合并:
public class CoGourpDemo {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env=ExecutionEnvironment.getExecutionEnvironment();
DataSet> source1=env.fromElements(
Tuple2.of(1L,"xiaoming"),
Tuple2.of(2L,"xiaowang"));
DataSet> source2=env.fromElements(
Tuple2.of(2L,"xiaoli"),
Tuple2.of(1L,"shinelon"),
Tuple2.of(3L,"hhhhhh"));
source1.coGroup(source2)
.where(0).equalTo(0)
.with(new CoGroupFunction, Tuple2, Object>() {
@Override
public void coGroup(Iterable> iterable,
Iterable> iterable1, Collector
运行结果如下所示:
{3=hhhhhh}
{1=xiaoming shinelon}
{2=xiaowang xiaoli}
join操作很常见,与我们数据库中常见的inner join类似,它数据的数据侧重与pair,它会按照一定的条件取出两个流或者数据集中匹配的数据返回给下游处理或者输出。
Join操作DataStream时只能用在window中,和cogroup操作一样。
1.操作DataStream
我们都知道window有三种window类型,因此join与其相对,也有三种,除此之外,还有Interval join:
它的编程模型如下:
stream.join(otherStream)
.where()
.equalTo()
.window()
.apply()
Tumbling Window Join
从上图中可以看出,join的是每一个窗口中的数据流,它会将一个窗口中相同key的数据按照inner join的方式进行连接,然后在apply方法中实现JoinFunction
或者FlatJoinFunction
方法来处理并且发送到下游。
Sliding Window Join
和上面的TumblingWindowJoin一样,只不过是窗口的类型不一样,需要实现JoinFunction
或者FlatJoinFunction
方法来处理并且发送到下游。
Session Window Join
下面看一个Tumbling Window Join的实例:
public class TumblingWindowJoinDemo {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env=StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream> input1=env.socketTextStream("192.168.217.110",9002)
.map(new MapFunction>() {
@Override
public Tuple2 map(String s) throws Exception {
return Tuple2.of(s.split(" ")[0],s.split(" ")[1]);
}
}).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks>() {
private long max=2000;
private long currentTime;
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(currentTime-max);
}
@Override
public long extractTimestamp(Tuple2 element, long event) {
long timestamp=event;
currentTime=Math.max(timestamp,currentTime);
return currentTime;
}
});
DataStream> input2=env.socketTextStream("192.168.217.110",9001)
.map(new MapFunction>() {
@Override
public Tuple2 map(String s) throws Exception {
return Tuple2.of(s.split(" ")[0],s.split(" ")[1]);
}
}).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks>() {
private long max=5000;
private long currentTime;
@Nullable
@Override
public Watermark getCurrentWatermark() {
return new Watermark(System.currentTimeMillis()-max);
}
@Override
public long extractTimestamp(Tuple2 element, long event) {
long timestamp=event;
currentTime=Math.max(timestamp,currentTime);
return currentTime;
}
});
input1.join(input2)
.where(new KeySelector, Object>() {
@Override
public Object getKey(Tuple2 t) throws Exception {
return t.f0;
}
}).equalTo(new KeySelector, Object>() {
@Override
public Object getKey(Tuple2 T) throws Exception {
return T.f0;
}
})
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.trigger(CountTrigger.of(1))
.apply(new JoinFunction, Tuple2, Object>() {
@Override
public Object join(Tuple2 tuple1, Tuple2 tuple2) throws Exception {
if(tuple1.f0.equals(tuple2.f0)){
return tuple1.f1+" "+tuple2.f1;
}
return null;
}
}).print();
env.execute();
}
}
打开两个终端,使用nc工具开启两个端口:
[shinelon@hadoop-senior Desktop]$ nc -lk 9001
1 hello
2 world
3 shinelon
5 lllll
4 sssss
[shinelon@hadoop-senior Desktop]$ nc -lk 9002
1 hello
2 limig
3 nihao
4 oooo
运行结果如下:
4> hello hello
2> limig world
3> nihao shinelon
1> oooo sssss
Interval Join
Interval Join会将两个数据流按照相同的key,并且在其中一个流的时间范围内的数据进行join处理。通常用于把一定时间范围内相关的分组数据拉成一个宽表。我们通常可以用类似下面的表达式来使用interval Join来处理两个数据流:
key1 == key2 && e1.timestamp + lowerBound <= e2.timestamp <= e1.timestamp + upperBound
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
...
DataStream orangeStream = ...
DataStream greenStream = ...
orangeStream
.keyBy()
.intervalJoin(greenStream.keyBy())
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction out) {
out.collect(first + "," + second);
}
});
2.操作DataSet
实例如下:
public class JoinDemo {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env=ExecutionEnvironment.getExecutionEnvironment();
DataSet> data1=env.fromElements(
Tuple2.of("class1",100),
Tuple2.of("class1",400),
Tuple2.of("class2",200),
Tuple2.of("class2",400)
);
DataSet> data2=env.fromElements(
Tuple2.of("class1",300),
Tuple2.of("class1",600),
Tuple2.of("class2",200),
Tuple2.of("class3",200)
);
data1.join(data2)
.where(0).equalTo(0)
.with(new JoinFunction, Tuple2, Object>() {
@Override
public Object join(Tuple2 tuple1,
Tuple2 tuple2) throws Exception {
return new String(tuple1.f0+" : "+tuple1.f1+" "+tuple2.f1);
}
}).print();
}
}
运行结果:
class1 : 100 300
class1 : 400 300
class1 : 100 600
class1 : 400 600
class2 : 200 200
class2 : 400 200
除此之外,在操作DataSet时还有很多join,如Outer Join,Flat Join等等,具体可以查看官方文档:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/batch/dataset_transformations.html#join
该操作不同于上面两个操作,只适用操作DataStream,它会将两个流中匹配的数据进行处理,不匹配不会进行处理,它会分别处理两个流,比上面的两个操作更加自由。
它通常使用于处理一个BroadCastStream和另一个DataStream中的数据,这个BroadCastStream中的数据一般情况下不会改变,用于装载一些全局的配置文件。比如我们有如下一个场景,需要从kafka的topic中读取数据进行处理,这时我们可以将不同topic中数据封装为一个BroadCastStream然后connect另一个流,这样会将全局配置发送到另一个流中连接kafka处理。
通常我们可以这样做:
// 事件流
final FlinkKafkaConsumer010 kafkaUserEventSource = new FlinkKafkaConsumer010(
params.get(INPUT_EVENT_TOPIC),
new UserEventDeserializationSchema(),consumerProps);
// (userEvent, userId)
KeyedStream customerUserEventStream = env
.addSource(kafkaUserEventSource)
.assignTimestampsAndWatermarks(new CustomWatermarkExtractor(Time.hours(24)))
.keyBy(new KeySelector() {
@Override
public String getKey(UserEvent userEvent) throws Exception {
return userEvent.getUserId();
}
});
//customerUserEventStream.print();
//配置流
final FlinkKafkaConsumer010 kafkaConfigEventSource = new FlinkKafkaConsumer010(
params.get(INPUT_CONFIG_TOPIC),
new ConfigDeserializationSchema(), consumerProps);
final BroadcastStream configBroadcastStream = env
.addSource(kafkaConfigEventSource)
.broadcast(configStateDescriptor);
//连接两个流
/* Kafka consumer */
Properties producerProps=new Properties();
producerProps.setProperty(BOOTSTRAP_SERVERS, params.get(BOOTSTRAP_SERVERS));
producerProps.setProperty(RETRIES, "3");
final FlinkKafkaProducer010 kafkaProducer = new FlinkKafkaProducer010(
params.get(OUTPUT_TOPIC),
new EvaluatedResultSerializationSchema(),
producerProps);
/* at_ least_once 设置 */
kafkaProducer.setLogFailuresOnly(false);
kafkaProducer.setFlushOnCheckpoint(true);
DataStream connectedStream = customerUserEventStream
.connect(configBroadcastStream)
.process(new ConnectedBroadcastProcessFuntion());
我们还需要实现一个ConnectedBroadcastProcessFuntion
类,它继承KeyedBroadcastProcessFunction
,我们可以实现processElement
和processBroadcastElement
方法来实现我们自己的逻辑,它们分别来处理普通流和广播流:
public class ConnectedBroadcastProcessFuntion extends KeyedBroadcastProcessFunction {
@Override
public void processElement(UserEvent value, ReadOnlyContext ctx, Collector out) throws Exception {
......
}
@Override
public void processBroadcastElement(Config value, Context ctx, Collector out) throws Exception {
//处理广播流时需要维护一个广播状态,它不会将数据流发送到后端,需要维护在一个状态中,这个不同于普通流,可以将数据放入RocksDB中,类似于下面这样处理
String channel = value.getChannel();
//维护一个广播状态
BroadcastState state = ctx.getBroadcastState(Launcher.configStateDescriptor);
final Config oldConfig = ctx.getBroadcastState(Launcher.configStateDescriptor).get(channel);
if(state.contains(channel)) {
log.info("Configured channel exists: channel=" + channel);
log.info("Config detail: oldConfig=" + oldConfig + ", newConfig=" + value);
}else {
log.info("Config detail: defaultConfig=" + defaultConfig + ", newConfig=" + value);
}
// update config value for configKey
state.put(channel, value);
.....
}
}
至此,整篇文章已经讲述完毕,如有问题,欢迎留言讨论。
欢迎加入java大数据交流群:731423890
参考资料:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/
https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/joining.html
https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/batch/dataset_transformations.html#join