DataStream API 是 Flink 编程的核心。为了让代码有更强大的表现力和易用性,Flink 本身提供了多层 API,DataStream API 只是中间的一环。
在更底层,我们可以不定义任何具体的算子(比如 map,filter,或者 window),而只是提炼出一个统一的“处理”(process)操作,它是所有转换算子的一个概括性的表达,可以自定义处理逻辑,叫作“处理函数”(process function)。
在处理函数中,我们直面的就是数据流中最基本的元素:数据事件(event)、状态(state)以及时间(time)。这就相当于对流有了完全的控制权。
我们通常把处理函数和富函数RichFunction做对比,常见的转换算子,如MapFunction、FlatMapFunction都有对应的富函数。
AbstractRichFunction,提供了获取运行时上下文的方法getRuntimeContext()和生命周期方法,可以拿到状态,还有并行度、任务名称之类的运行时信息。
public abstract class AbstractRichFunction implements RichFunction, Serializable {
// 运行时上下文
private transient RuntimeContext runtimeContext;
public void setRuntimeContext(RuntimeContext t) { this.runtimeContext = t;}
public RuntimeContext getRuntimeContext() {}
public IterationRuntimeContext getIterationRuntimeContext() {}
// 生命周期方法
public void open(Configuration parameters) throws Exception {}
public void close() throws Exception {}
}
//运行时上下文
public interface RuntimeContext {
//并行度、任务名称等基本的运行时信息
/** returned ID should NOT be used for any job management tasks. */
JobID getJobId();
/** The name of the task in which the UDF runs. */
String getTaskName();
/** The metric group for this parallel subtask. */
MetricGroup getMetricGroup();
/** The parallelism with which the parallel task runs. */
int getNumberOfParallelSubtasks();
/** The max-parallelism with which the parallel task runs. */
int getMaxNumberOfParallelSubtasks();
/** The index of the parallel subtask. */
int getIndexOfThisSubtask();
/** Attempt number of the subtask.尝试次数 */
int getAttemptNumber();
/** The name of the task, with subtask indicator. */
String getTaskNameWithSubtasks();
ExecutionConfig getExecutionConfig();
/** The ClassLoader for user code classes. */
ClassLoader getUserCodeClassLoader();
/** Registers a custom hook for the user code class loader release. */
void registerUserCodeClassLoaderReleaseHookIfAbsent( String releaseHookName, Runnable releaseHook);
...
// 状态
<T> ValueState<T> getState(ValueStateDescriptor<T> stateProperties);
<T> ListState<T> getListState(ListStateDescriptor<T> stateProperties);
<T> ReducingState<T> getReducingState(ReducingStateDescriptor<T> stateProperties);
<IN, ACC, OUT> AggregatingState<IN, OUT> getAggregatingState(
AggregatingStateDescriptor<IN, ACC, OUT> stateProperties);
<UK, UV> MapState<UK, UV> getMapState(MapStateDescriptor<UK, UV> stateProperties);
}
处理函数(ProcessFunction)继承了AbstractRichFunction抽象类,拥有富函数类的所有特性。
除此之外,其当前运行的上下文可以直接将数据输出到侧输出流(side output)中;另外提供了一个“定时服务”,访问流中的时间戳、水位线,甚至可以注册“定时事件”。
public abstract class ProcessFunction<I, O> extends AbstractRichFunction {
public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;
//只有KeyedStream才支持设置定时器的操作
public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {}
//上下文可以直接将数据输出到侧输出流(side output)中;提供了一个“定时服务”
public abstract class Context {
/** TimeCharacteristic#ProcessingTime 是个null */
public abstract Long timestamp();
public abstract TimerService timerService();
public abstract <X> void output(OutputTag<X> outputTag, X value);
}
public abstract class OnTimerContext extends Context {
/** The {@link TimeDomain} of the firing timer. */
public abstract TimeDomain timeDomain();
}
}
//定时服务中可以时间戳、水位线,注册和删除“闹钟”
public interface TimerService {
long currentProcessingTime(); //processing time
long currentWatermark(); //event-time watermark
void registerProcessingTimeTimer(long time);
void registerEventTimeTimer(long time);
void deleteProcessingTimeTimer(long time);
void deleteEventTimeTimer(long time);
}
只有在 KeyedStream 中才支持使用 TimerService 设置定时器的操作。
定时器(timers)是处理函数中进行时间相关操作的主要机制。在.onTimer()方法中可以实现定时处理的逻辑,而它能触发的前提,就是之前曾经注册过定时器、并且现在已经到了触发时间。注册定时器的功能,是通过上下文中提供的“定时服务”(TimerService)来实现的。
对于处理时间和事件时间这两种类型的定时器,TimerService 内部会用一个优先队列将它们的时间戳保存起来,排队等待执行。可以认为,定时器其实是 KeyedStream 上处理算子的一个状态,它以时间戳作为区分。所以 TimerService 会以键(key)和时间戳为标准,对定时器进行去重;也就是说对于每个 key 和时间戳,最多只有一个定时器,如果注册了多次,onTimer()方法也将只被调用一次。
Flink 对.onTimer()和.processElement()方法是同步调用的(synchronous),所以也不会出现状态的并发修改。
Flink 的定时器同样具有容错性,它和状态一起都会被保存到一致性检查点中。当发生故障时,Flink 会重启并读取检查点中的状态,恢复定时器。如果是处理时间的定时器,有可能会出现已经“过期”的情况,这时它们会在重启时被立刻触发。
KeyedProcessFunction用于KeyedStream中,功能和基本处理函数ProcessFunction类似,但相比于ProcessFunction,它可以设置定时器的操作。
stream.keyBy(data -> true) // 基于KeyedStream定义事件时间定时器
.process(new KeyedProcessFunction<Boolean, Event, String>() {
@Override
public void processElement(Event value, Context ctx,
Collector<String> out) throws Exception {
out.collect("数据到达,时间戳为:" + ctx.timestamp());
out.collect("数据到达,水位线为:" + ctx.timerService().currentWatermark());
// 注册一个10秒后的定时器
ctx.timerService().registerEventTimeTimer(ctx.timestamp()+10*1000L);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx,
Collector<String> out) throws Exception {
out.collect("定时器触发,触发时间:" + timestamp);
}
})
与基本处理函数ProcessFunction相比,ProcessWindowFunction在功能和使用上有以下变化:
这样设计无疑会让处理流程更加清晰——定时操作也是一种“触发”,所以我们就让所有的触发操作归触发器管,而所有处理数据的操作则归窗口函数管。
stream.keyBy( t -> t.f0 )
.window( TumblingEventTimeWindows.of(Time.seconds(10)) )
.process(new ProcessWindowFunction(){...})
public abstract class ProcessWindowFunction<IN, OUT, KEY, W extends Window>
extends AbstractRichFunction {
public abstract void process(KEY key, Context context, Iterable<IN> elements,
Collector<OUT> out) throws Exception;
public void clear(Context context) throws Exception {}
/** The context holding window metadata. */
public abstract class Context implements java.io.Serializable {
public abstract W window();
public abstract long currentProcessingTime();
public abstract long currentWatermark();
public abstract KeyedStateStore windowState();
public abstract KeyedStateStore globalState();
public abstract <X> void output(OutputTag<X> outputTag, X value);
}
}
相比于ProcessWindowFunction,它的Context没有了currentProcessingTime()和currentWatermark()。
sounce.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessAllWindowFunction(){...})
public abstract class ProcessAllWindowFunction<IN, OUT, W extends Window>
extends AbstractRichFunction {
public abstract void process(Context context, Iterable<IN> elements,
Collector<OUT> out) throws Exception;
public void clear(Context context) throws Exception {}
public abstract class Context {
public abstract W window();
public abstract KeyedStateStore windowState();
public abstract KeyedStateStore globalState();
public abstract <X> void output(OutputTag<X> outputTag, X value);
}
}
相比于基本处理函数ProcessFunction,processElement有变化,变为了两个方法,processElement1和processElement2。
相比于基础CoProcessFunction,Context和OnTimerContext有变化,都增加了getCurrentKey()方法。
其更像RichJoinFunction,多了侧输出流和当前时间戳。
public abstract class ProcessJoinFunction<IN1, IN2, OUT> extends AbstractRichFunction {
public abstract void processElement(IN1 left, IN2 right, Context ctx, Collector<OUT> out) throws Exception;
public abstract class Context {
public abstract long getLeftTimestamp();
public abstract long getRightTimestamp();
/** @return The timestamp of the joined pair. */
public abstract long getTimestamp();
public abstract <X> void output(OutputTag<X> outputTag, X value);
}
}
相比于处理函数ProcessFunction,不同如下:
多了processBroadcastElement方法,没有.onTimer()方法,
context里面多了 getBroadcastState() 方法。不再持有TimerService对象, 只能通过currentProcessingTime和currentWatermark来获取当前时间
ReadOnlyContext与context方法一样,但其调用时不能改。
public abstract class BroadcastProcessFunction<IN1, IN2, OUT> extends
BaseBroadcastProcessFunction {
...
public abstract void processElement(IN1 value, ReadOnlyContext ctx,
Collector<OUT> out) throws Exception;
public abstract void processBroadcastElement(IN2 value, Context ctx,
Collector<OUT> out) throws Exception;
...
}
相比于函数BroadcastProcessFunction,不同如下:
其功能时分流,从主流里面分出来测流,可以与主流的数据类型不同。
OutputTag<String> outputTag = new OutputTag<String>("side-output") {};
SingleOutputStreamOperator<Long> longStream = stream.process(
new ProcessFunction<Integer, Long>() {
@Override
public void processElement(Integer value, Context ctx,
Collector<Integer> out) throws Exception {
out.collect(Long.valueOf(value)); // 转换成Long,输出到主流中
// 转换成String,输出到侧输出流中
ctx.output(outputTag, "side-output: " + String.valueOf(value));
}
});
DataStream<String> stringStream = longStream.getSideOutput(outputTag);
不推荐,一方面会将并行度强制改为1,另一方面没有了预聚合,攒一个窗口的数据处理一次,类似批处理。具体示例如下:
stream.windowAll(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.process(new ProcessAllWindowFunction<String, String, TimeWindow>(){
@Override
public void process(Context context, Iterable<String> elements,
Collector<String> out) throws Exception {
HashMap<String, Long> urlCountMap = new HashMap<>();
// 遍历窗口中数据,将浏览量保存到一个 HashMap 中
for (String url : elements){
if(urlCountMap.containsKey(url)) {
urlCountMap.put(url, urlCountMap.get(url) + 1L);
} else {
urlCountMap.put(url, 1L);
}
}
ArrayList<Tuple2<String,Long>> mapList=new ArrayList<>();
// 将浏览量数据放入ArrayList,进行排序
for (String key : urlCountMap.keySet()) {
mapList.add(Tuple2.of(key, urlCountMap.get(key)));
}
mapList.sort(new Comparator<Tuple2<String, Long>>() {
@Override
public int compare(Tuple2<String,Long> o1,Tuple2<String,Long> o2) {
return o2.f1.intValue() - o1.f1.intValue();
}
});
// 取排序后的前两名,构建输出结果
StringBuilder result = new StringBuilder();
result.append("========================================\n");
for (int i = 0; i < 2; i++) {
Tuple2<String, Long> temp = mapList.get(i);
String info = "浏览量No." + (i + 1) + "url:" + temp.f0 + "浏览量:" + temp.f1 + "窗口结束时间:" + new Timestamp(context.window().getEnd())+"\n";
result.append(info);
}
result.append("========================================\n");
out.collect(result.toString());
}
});
// 第一步:按key分组,在给定时间窗口内求每个key的个数
SingleOutputStreamOperator<Tuple3<String, Long, Long>> aggregate = sounce.keyBy(data -> data.f0)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.aggregate(new AggregateFunction<Tuple2<String, Long>, Long, Long>() {
@Override
public Long createAccumulator() { return 0L; }
@Override
public Long add(Tuple2<String, Long> value, Long accumulator) {
return accumulator + 1;
}
@Override
public Long getResult(Long accumulator) { return accumulator; }
@Override
public Long merge(Long a, Long b) { return a + b;}
},
new ProcessWindowFunction<Long, Tuple3<String, Long, Long>,
String, TimeWindow>() {
@Override
public void process(String s, Context context, Iterable<Long> elements,
Collector<Tuple3<String, Long, Long>> out) throws Exception {
Long num = elements.iterator().next();
long end = context.window().getEnd();
out.collect(Tuple3.of(s, end, num));
}
});
// 第一步:按窗口结束时间分组,在给定时间窗口内求 Top N
aggregate.keyBy(data -> data.f1)
.process(new KeyedProcessFunction<Long, Tuple3<String, Long, Long>, String>{
private final int value = 2;
private ListState<Tuple3<String,Long,Long>> listState;
@Override
public void open(Configuration parameters) throws Exception {
listState = getRuntimeContext().getListState(new ListStateDescriptor<>
("count", Types.TUPLE(Types.STRING, Types.LONG, Types.LONG)));
}
@Override
public void processElement(Tuple3<String, Long,Long> value, Context ctx,
Collector<String> out) throws Exception {
listState.add(value);
ctx.timerService().registerEventTimeTimer(ctx.getCurrentKey() + 1);
}
@Override
public void onTimer(long timestamp, OnTimerContext ctx,
Collector<String> out) throws Exception {
ArrayList<Tuple3<String, Long, Long>> tuple3s = new ArrayList<>();
for (Tuple3<String, Long, Long> element : listState.get()) {
tuple3s.add(element);
}
tuple3s.sort(new Comparator<Tuple3<String, Long, Long>>() {
@Override
public int compare(Tuple3<String, Long, Long> o1,
Tuple3<String, Long, Long> o2) {
return (int)(o2.f2 - o1.f2);
}
});
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append("----------------------------\n窗口结束时间:");
stringBuilder.append(new TimeStamp(ctx.getCurrentKey()) + "\n");
for(int i = 0; i < value; i++) {
Tuple3<String, Long, Long> stringLongLongTuple3 = tuple3s.get(i);
String info = "No."+(i+1)+"FirstName:"+stringLongLongTuple3.f0 +
"访问量:" + stringLongLongTuple3.f2 + "\n";
stringBuilder.append(info);
}
stringBuilder.append("----------------------------\n");
out.collect(stringBuilder.toString());
}
});
下一章:Flink 1.13 多流转换