- CEP和状态机
- 状态机的表示和如何作用在流上
Flink中CEP的一般代码结构如下:
val input = env.fromElements(
new Event(1, "barfoo", 1.0),
new Event(2, "start", 2.0),
new Event(3, "foobar", 3.0),
new SubEvent(4, "foo", 4.0, 1.0),
new Event(5, "middle", 5.0),
new SubEvent(6, "middle", 6.0, 2.0),
new SubEvent(7, "bar", 3.0, 3.0),
new Event(42, "42", 42.0),
new Event(8, "end", 1.0)
)
val pattern: Pattern[Event, Event] = Pattern.begin[Event]("start")
.where(new SimpleCondition[Event] {
override def filter(e: Event): Boolean = {
e.name.equals("start")
}
})
.followedByAny("middle").subtype[SubEvent](classOf[SubEvent])
.where(new SimpleCondition[SubEvent] {
override def filter(e: SubEvent): Boolean = {
e.name.equals("middle")
}
})
.followedByAny("end")
.where(new SimpleCondition[Event] {
override def filter(e: Event): Boolean = {
e.name.equals("end")
}
})
val patternStream = CEP.pattern(input, pattern)
val result = patternStream.process(
new PatternProcessFunction[Event, String] {
// 此处因为数据放在一个map里面了, 丧失了先后顺序需要特别注意
override def processMatch(matchResult: util.Map[String, util.List[Event]],
ctx: PatternProcessFunction.Context, out: Collector[String]): Unit = {
val info = matchResult.asScala.map{ case (k, v) =>
(k, v.asScala.mkString(","))
}.mkString(";")
out.collect(info)
}
}
)
result.print()
env.execute("cep demo")
从上面可以看出入口是
- 一个一般的
DataStream
- 然后进过一个
Pattern
, 得到一个PatternStream
, - 最后再通过调用
PatternStream#process
又变成一个一般的DataStream
1. PatternStream#process
现在我们具体看下process
到底做了什么
public SingleOutputStreamOperator process(
final PatternProcessFunction patternProcessFunction,
final TypeInformation outTypeInfo) {
return builder.build(
outTypeInfo,
builder.clean(patternProcessFunction));
}
SingleOutputStreamOperator build(
final TypeInformation outTypeInfo,
final PatternProcessFunction processFunction) {
final TypeSerializer inputSerializer = inputStream.getType().createSerializer(inputStream.getExecutionConfig());
final boolean isProcessingTime = inputStream.getExecutionEnvironment().getStreamTimeCharacteristic() == TimeCharacteristic.ProcessingTime;
final boolean timeoutHandling = processFunction instanceof TimedOutPartialMatchHandler;
final NFACompiler.NFAFactory nfaFactory = NFACompiler.compileFactory(pattern, timeoutHandling);
final CepOperator operator = new CepOperator<>(
inputSerializer,
isProcessingTime,
nfaFactory,
comparator,
pattern.getAfterMatchSkipStrategy(),
processFunction,
lateDataOutputTag);
final SingleOutputStreamOperator patternStream;
if (inputStream instanceof KeyedStream) {
KeyedStream keyedStream = (KeyedStream) inputStream;
patternStream = keyedStream.transform(
"CepOperator",
outTypeInfo,
operator);
} else {
KeySelector keySelector = new NullByteKeySelector<>();
patternStream = inputStream.keyBy(keySelector).transform(
"GlobalCepOperator",
outTypeInfo,
operator
).forceNonParallel();
}
return patternStream;
}
从上面可以看出具体的计算其实还是封装进了CepOperator
里面了
2. CepOperator
数据存储对象:
private transient ValueState computationStates;
private transient MapState> elementQueueState;
private transient SharedBuffer partialMatches;
对每个元素的处理情况:
@Override
public void processElement(StreamRecord element) throws Exception {
if (isProcessingTime) {
if (comparator == null) {
// there can be no out of order elements in processing time
NFAState nfaState = getNFAState();
long timestamp = getProcessingTimeService().getCurrentProcessingTime();
advanceTime(nfaState, timestamp);
processEvent(nfaState, element.getValue(), timestamp);
updateNFA(nfaState);
} else {
long currentTime = timerService.currentProcessingTime();
bufferEvent(element.getValue(), currentTime);
// register a timer for the next millisecond to sort and emit buffered data
timerService.registerProcessingTimeTimer(VoidNamespace.INSTANCE, currentTime + 1);
}
} else {
long timestamp = element.getTimestamp();
IN value = element.getValue();
// In event-time processing we assume correctness of the watermark.
// Events with timestamp smaller than or equal with the last seen watermark are considered late.
// Late events are put in a dedicated side output, if the user has specified one.
if (timestamp > lastWatermark) {
// we have an event with a valid timestamp, so
// we buffer it until we receive the proper watermark.
saveRegisterWatermarkTimer();
bufferEvent(value, timestamp);
} else if (lateDataOutputTag != null) {
output.collect(lateDataOutputTag, element);
}
}
}
从上面可以看出当isProcessingTime && comparator == null
的时候, 会进行数据的及时处理
// 找出超时的元素
private void advanceTime(NFAState nfaState, long timestamp) throws Exception {
try (SharedBufferAccessor sharedBufferAccessor = partialMatches.getAccessor()) {
Collection>, Long>> timedOut =
nfa.advanceTime(sharedBufferAccessor, nfaState, timestamp);
if (!timedOut.isEmpty()) {
processTimedOutSequences(timedOut);
}
}
}
// 处理每条数据
private void processEvent(NFAState nfaState, IN event, long timestamp) throws Exception {
try (SharedBufferAccessor sharedBufferAccessor = partialMatches.getAccessor()) {
Collection
其他的都是调用 bufferEvent
并同时注册一个定时器, 来处理这些缓存起来的数据
bufferEvent
将数据都放入了elementQueueState
private void bufferEvent(IN event, long currentTime) throws Exception {
List elementsForTimestamp = elementQueueState.get(currentTime);
if (elementsForTimestamp == null) {
elementsForTimestamp = new ArrayList<>();
}
if (getExecutionConfig().isObjectReuseEnabled()) {
// copy the StreamRecord so that it cannot be changed
elementsForTimestamp.add(inputSerializer.copy(event));
} else {
elementsForTimestamp.add(event);
}
elementQueueState.put(currentTime, elementsForTimestamp);
}
又因为CepOperator
继承了 Triggerable
并实现了 onEventTime
和 onProcessingTime
, 所以上面的定时器触发的时候就可以调用这2个实现来处理数据了
private PriorityQueue getSortedTimestamps() throws Exception {
PriorityQueue sortedTimestamps = new PriorityQueue<>();
for (Long timestamp : elementQueueState.keys()) {
sortedTimestamps.offer(timestamp);
}
return sortedTimestamps;
}
@Override
public void onEventTime(InternalTimer timer) throws Exception {
// 1) get the queue of pending elements for the key and the corresponding NFA,
// 2) process the pending elements in event time order and custom comparator if exists
// by feeding them in the NFA
// 3) advance the time to the current watermark, so that expired patterns are discarded.
// 4) update the stored state for the key, by only storing the new NFA and MapState iff they
// have state to be used later.
// 5) update the last seen watermark.
// STEP 1
PriorityQueue sortedTimestamps = getSortedTimestamps();
NFAState nfaState = getNFAState();
// STEP 2
while (!sortedTimestamps.isEmpty() && sortedTimestamps.peek() <= timerService.currentWatermark()) {
long timestamp = sortedTimestamps.poll();
advanceTime(nfaState, timestamp);
try (Stream elements = sort(elementQueueState.get(timestamp))) {
elements.forEachOrdered(
event -> {
try {
processEvent(nfaState, event, timestamp);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
);
}
elementQueueState.remove(timestamp);
}
// STEP 3
advanceTime(nfaState, timerService.currentWatermark());
// STEP 4
updateNFA(nfaState);
if (!sortedTimestamps.isEmpty() || !partialMatches.isEmpty()) {
saveRegisterWatermarkTimer();
}
// STEP 5
updateLastSeenWatermark(timerService.currentWatermark());
}
@Override
public void onProcessingTime(InternalTimer timer) throws Exception {
// 1) get the queue of pending elements for the key and the corresponding NFA,
// 2) process the pending elements in process time order and custom comparator if exists
// by feeding them in the NFA
// 3) update the stored state for the key, by only storing the new NFA and MapState iff they
// have state to be used later.
// STEP 1
PriorityQueue sortedTimestamps = getSortedTimestamps();
NFAState nfa = getNFAState();
// STEP 2
while (!sortedTimestamps.isEmpty()) {
long timestamp = sortedTimestamps.poll();
advanceTime(nfa, timestamp);
try (Stream elements = sort(elementQueueState.get(timestamp))) {
elements.forEachOrdered(
event -> {
try {
processEvent(nfa, event, timestamp);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
);
}
elementQueueState.remove(timestamp);
}
// STEP 3
updateNFA(nfa);
}
3. NFA
从上面的代码可以看出代码的核心处理都放在了NFA里面了
NFA的具体论文参见 Efficient Pattern Matching over Event Streams
对于开发人员来说我们需要关注NFA的大概实现逻辑和解决的核心问题就可以了
上面调用的NFA方法有2个:
advanceTime
process
3.1 NFACompiler
NFA
的初始化使用到了NFACompiler
final NFACompiler.NFAFactory nfaFactory = NFACompiler.compileFactory(pattern, timeoutHandling);
final CepOperator operator = new CepOperator<>(
inputSerializer,
isProcessingTime,
nfaFactory,
comparator,
pattern.getAfterMatchSkipStrategy(),
processFunction,
lateDataOutputTag);
该类将pattern
进行处理, 得到一个NFAFactory
并将其传入了CepOperator
, 而不是将pattern
传入了进去
public static NFAFactory compileFactory(
final Pattern pattern,
boolean timeoutHandling) {
if (pattern == null) {
// return a factory for empty NFAs
return new NFAFactoryImpl<>(0, Collections.>emptyList(), timeoutHandling);
} else {
final NFAFactoryCompiler nfaFactoryCompiler = new NFAFactoryCompiler<>(pattern);
nfaFactoryCompiler.compileFactory();
return new NFAFactoryImpl<>(nfaFactoryCompiler.getWindowTime(), nfaFactoryCompiler.getStates(), timeoutHandling);
}
}
在compileFactory
函数里面会真正将pattern
和 states
关联起来, 这里的states
也会在下面初始化NFA
的时候使用到, 并且不再变化
void compileFactory() {
if (currentPattern.getQuantifier().getConsumingStrategy() == Quantifier.ConsumingStrategy.NOT_FOLLOW) {
throw new MalformedPatternException("NotFollowedBy is not supported as a last part of a Pattern!");
}
checkPatternNameUniqueness();
checkPatternSkipStrategy();
// we're traversing the pattern from the end to the beginning --> the first state is the final state
State sinkState = createEndingState();
// add all the normal states
sinkState = createMiddleStates(sinkState);
// add the beginning state
createStartState(sinkState);
}
在CepOperator#open
里面创建NFA
@Override
public NFA createNFA() {
return new NFA<>(states, windowTime, timeoutHandling);
}
3.2 NFA#process
在NFA
中它自身的成员变量 states
(即上文提到的) 是静态的, 不变的, 而我们的代码会随着数据的不断变化整个缓存的数据会处于不同的状态这些状态的变动都是由NFAState
来维护的
由于新来的数据, 当这个数据进入状态机的不同地方, 会产生不同的后续状态, 因此需要用当前的数据来驱动当前状态机的所有状态, 此时真正的数据都在SharedBuffer
里面, 并通过sharedBufferAccessor
来访问/修改
以下是代码的实现逻辑看起来很复杂, 具体的逻辑可以参看上面提到的论文,能有个大概的了解
private Collection