文章基于Flink 官方文档翻译,翻译水平不高,故尽量贴出原文内容以供印证。原文地址:Dataflow Programming Model
抽象层次(Levels of Abstraction)
Flink提供了开发流/批处理应用程序的不同抽象级别。
最低级别抽象只提供有状态流(stateful streaming)。它通过Process函数嵌入到DataStream API中。它允许用户自由处理来自一个或多个流的事件,并使用一致的容错状态。此外,用户可以注册事件时间和处理时间回调,允许程序实现复杂的计算。
实际上,大多数应用程序不需要上面描述的低层抽象,而是根据核心API (DataStream API(有界/无界流)和DataSet API(有界数据集)进行编程。这些连贯api为数据处理提供了常见的构建块,比如用户指定的各种形式的转换、连接、聚合、窗口、状态等。在这些api中处理的数据类型用各自的编程语言表示为类。
-
Table API 是一个以数据表为中心的声明性DSL,表可以(在表示流时)动态地更改表。Table API遵循(扩展的)关系模型:表有一个附加的模式(类似于关系数据库中的表),而API提供了类似的操作,如select、project、join、group-by、aggregate等。Table API程序声明性地定义应该执行什么逻辑操作,而不是确切地指定操作代码的外观。虽然Table API可以通过各种类型的用户定义的函数进行扩展,但是它比Core APIs更缺乏表现力,但是使用起来更简洁(编写的代码更少)。此外,Table API程序在执行之前还需要经过一个应用优化规则的优化器。
可以在tables和DataStream/DataSet之间无缝转换,允许Table API和DataStream/DataSet API混合使用。
Flink提供的最高级抽象是SQL。这种抽象在语义和表达性上都类似于Table API,但将程序表示为SQL查询表达式。SQL抽象与Table API紧密交互,SQL查询可以在Table API中定义的表上执行。
程序和数据流(Programs and Dataflows)
The basic building blocks of Flink programs are streams and transformations. (Note that the DataSets used in Flink’s DataSet API are also streams internally – more about that later.) Conceptually a stream is a (potentially never-ending) flow of data records, and a transformation is an operation that takes one or more streams as input, and produces one or more output streams as a result.
流(streams)和转换(transformations)是构成Flink程序的基本组成。(需要注意的是,Flink的DataSet API也是基于流实现的——稍后会详细介绍。)从概念上讲,流是一种(可能永远不会结束)流动的数据记录,而转换是一种操作,它接收一个或者多个流作为输入,并且生产出一个或多个输出流作为结果。
When executed, Flink programs are mapped to streaming dataflows, consisting of streams and transformation operators. Each dataflow starts with one or more sources and ends in one or more sinks. The dataflows resemble arbitrary directed acyclic graphs (DAGs). Although special forms of cycles are permitted via iteration constructs, for the most part we will gloss over this for simplicity.
当执行时,Flink程序被影射到流式数据流(streaming dataflows),由流和转换操作组成。每个数据流开始于一个或多个数据源(sources),并结束于一个或多个接收器(sinks)。数据流类似于任意有向无环图(directed acyclic graphs DAGs)。虽然通过迭代构造允许使用特殊形式的循环,但是为了简单起见,我们将在大多数情况下忽略这一点。
Often there is a one-to-one correspondence between the transformations in the programs and the operators in the dataflow. Sometimes, however, one transformation may consist of multiple transformation operators.
Sources and sinks are documented in the streaming connectors and batch connectors docs. Transformations are documented in DataStream operators and DataSet transformations.
通常,程序中的转换与数据流中的操作符之间存在一对一的对应关系。然而,有时一个转换可能包含多个转换操作符。
源(sources)和接收器(sinks)记录在streaming connectors和batch connectors文档中。转换记录在DataStream operators和DataSet transformations中。
并行数据流(Parallel Dataflows)
Programs in Flink are inherently parallel and distributed. During execution, a stream has one or more stream partitions, and each operator has one or more operator subtasks. The operator subtasks are independent of one another, and execute in different threads and possibly on different machines or containers.
The number of operator subtasks is the parallelism of that particular operator. The parallelism of a stream is always that of its producing operator. Different operators of the same program may have different levels of parallelism.
Flink程序本质上是并行和分布式。在执行期间,一个流有一个或多个流分区(stream partitions),每个操作符有一个或多个操作符子任务(operator subtasks)。操作符子任务彼此独立,并在不同的线程中执行,甚至在不同的机器或容器上执行。
操作符子任务的数量为该特定操作符的并行度(parallelism)。 流的并行度总是取决于它的生产操作。同一程序的不同操作符可能具有不同级别的并行度。
Streams can transport data between two operators in a one-to-one (or forwarding) pattern, or in a redistributing pattern:
- One-to-one streams (for example between the Source and the map() operators in the figure above) preserve the partitioning and ordering of the elements. That means that subtask[1] of the map() operator will see the same elements in the same order as they were produced by subtask[1] of the Source operator.
- Redistributing streams (as between map() and keyBy/window above, as well as between keyBy/window and Sink) change the partitioning of streams. Each operator subtask sends data to different target subtasks, depending on the selected transformation. Examples are keyBy() (which re-partitions by hashing the key), broadcast(), or rebalance() (which re-partitions randomly). In a redistributing exchange the ordering among the elements is only preserved within each pair of sending and receiving subtasks (for example, subtask[1] of map() and subtask[2] of keyBy/window). So in this example, the ordering within each key is preserved, but the parallelism does introduce non-determinism regarding the order in which the aggregated results for different keys arrive at the sink.
Details about configuring and controlling parallelism can be found in the docs on parallel execution.
流可以在两个操作符之间以一对一 one to one(或转发 forwarding)模式传输数据,也可以采用重新分发 Redistributing 模式:
一对一流(例如上图中的
Source
和map()
操作符)保持数据元素的分区和顺序。意味着map()[1]
处理的数据与Source[1]
生产的元素相同,且顺序一致。重新分发流 (如上图中的
map()
和keyBy/window
,以及keyBy/window
和Sink
)会改变流的分区。每个操作符子任务根据所选的转换向不同的目标子任务发送数据。例如keyBy()
(通过散列键重新分区)、broadcast()
或rebalance()
(随机重新分区)。在重分发交换中,元素之间的顺序只保留在每对发送和接收子任务中(例如map()[1]和keyBy/window[2])。因此,在本例中,保留了每个键的顺序,但是并行性确实引入了关于不同键的聚合结果到达接收器的顺序的非确定性。
有关配置和控制并行性的详细信息可以在并行执行的文档中找到。
窗口 (Windows)
Aggregating events (e.g., counts, sums) works differently on streams than in batch processing. For example, it is impossible to count all elements in a stream, because streams are in general infinite (unbounded). Instead, aggregates on streams (counts, sums, etc), are scoped by windows, such as “count over the last 5 minutes”, or “sum of the last 100 elements”.
Windows can be time driven (example: every 30 seconds) or data driven (example: every 100 elements). One typically distinguishes different types of windows, such as tumbling windows (no overlap), sliding windows (with overlap), and session windows (punctuated by a gap of inactivity).
More window examples can be found in this blog post. More details are in the window docs.
聚合事件(如counts、sums)在流上的工作方式与在批处理中不同。例如,不可能计算流中的所有元素,因为流通常是无限的(无界的)。作为替代,流上的聚合由窗口windows 限定范围,例如”过去5分钟内的数量“或“最后100个元素的总和”。
窗口可以是时间驱动的(例如:每30秒),也可以是数据驱动的(例如:每100个元素)。一个典型的例子是区分不同类型的窗口,比如翻滚窗口 tumbling windows(没有重叠)、滑动窗口 sliding windows(有重叠)和会话窗口 session windows(中间有一个不活动的间隙)。
更多的窗口例子可以在这篇博文中找到 blog post。更多细节在windows docs中。
时间(Time)
When referring to time in a streaming program (for example to define windows), one can refer to different notions of time:
- Event Time is the time when an event was created. It is usually described by a timestamp in the events, for example attached by the producing sensor, or the producing service. Flink accesses event timestamps via timestamp assigners.
- Ingestion time is the time when an event enters the Flink dataflow at the source operator.
- Processing Time is the local time at each operator that performs a time-based operation.
More details on how to handle time are in the event time docs.
在流处理程序中提到时间(例如定义 windows)时,可以指不同的时间概念:
事件时间(Event Time)是创建事件的时间。它通常由事件中的时间戳描述,例如由生产传感器或生产服务附加的时间戳。Flink通过时间戳分配程序timestamp assigners访问事件时间戳。
摄入时间(Ingestion time)是事件在源操作符处进入Flink数据流的时间。
处理时间(Processing Time)是指每个基于时间的操作符执行的本地时间。
关于如何处理时间的更多细节在这里 event time docs。
有状态操作(Stateful Operations)
While many operations in a dataflow simply look at one individual event at a time (for example an event parser), some operations remember information across multiple events (for example window operators). These operations are called stateful.
The state of stateful operations is maintained in what can be thought of as an embedded key/value store. The state is partitioned and distributed strictly together with the streams that are read by the stateful operators. Hence, access to the key/value state is only possible on keyed streams, after a keyBy() function, and is restricted to the values associated with the current event’s key. Aligning the keys of streams and state makes sure that all state updates are local operations, guaranteeing consistency without transaction overhead. This alignment also allows Flink to redistribute the state and adjust the stream partitioning transparently.
For more information, see the documentation on state.
虽然数据流中的许多操作一次只查看一个单独的事件(例如事件解析器),但是有些操作记录了跨多个事件的信息(例如窗口操作符)。这些操作称为有状态的(stateful)。
有状态操作的状态维护在可视为嵌入式键/值存储。状态与有状态操作符读取的流一起被严格地分区和分布。因此,只有通过keyBy()
函数执行后的keyed streams才能访问键/值状态,并且只能访问与当前事件的键关联的值。将流的键与状态对齐可以确保所有状态更新都是本地操作,从而确保一致性,而不需要事务开销。这种对齐还允许Flink重新分配状态并透明地调整流分区。
更多信息,详见相关文档 state
容错检查点(Checkpoints for Fault Tolerance)
Flink implements fault tolerance using a combination of stream replay and checkpointing. A checkpoint is related to a specific point in each of the input streams along with the corresponding state for each of the operators. A streaming dataflow can be resumed from a checkpoint while maintaining consistency (exactly-once processing semantics) by restoring the state of the operators and replaying the events from the point of the checkpoint.
The checkpoint interval is a means of trading off the overhead of fault tolerance during execution with the recovery time (the number of events that need to be replayed).
The description of the fault tolerance internals provides more information about how Flink manages checkpoints and related topics. Details about enabling and configuring checkpointing are in the checkpointing API docs.
Flink使用流回放(stream replay)和检查点(checkpoint)的组合实现容错。检查点与每个输入流中的特定点以及每个操作符的对应状态相关。流数据流可以从检查点恢复,同时通过恢复操作符的状态并从检查点重播事件来保持一致性(精确地说,一次处理语义)。
检查点间隔是一种用恢复时间(需要重播的事件数量)来抵消执行期间容错开销的方法。
fault tolerance internals 提供了关于Flink如何管理检查点和相关主题的更多信息。启用和配置检查点的详细信息在checkpointing API docs中。
批处理(Batch on Streaming)
Flink executes batch programs as a special case of streaming programs, where the streams are bounded (finite number of elements). A DataSet is treated internally as a stream of data. The concepts above thus apply to batch programs in the same way as well as they apply to streaming programs, with minor exceptions:
- Fault tolerance for batch programs does not use checkpointing. Recovery happens by fully replaying the streams. That is possible, because inputs are bounded. This pushes the cost more towards the recovery, but makes the regular processing cheaper, because it avoids checkpoints.
- Stateful operations in the DataSet API use simplified in-memory/out-of-core data structures, rather than key/value indexes.
- The DataSet API introduces special synchronized (superstep-based) iterations, which are only possible on bounded streams. For details, check out the iteration docs.
Flink把批处理作为流式处理的一种特殊情况,即有界流(有限数量的元素)。DataSet在内部被视为数据流。因此,上述概念同样适用于批处理程序,只有少数不同:
批处理程序的容错不使用检查点。恢复是通过完全重放流来实现的。这是合理的,因为输入是有界的。使用常规处理更简单,因为它避免了使用检查点。
DataSet API中的有状态操作使用简化的内存/内核外(in-memory/out-of-core)数据结构,而不是键/值索引(key/value indexes)。
DataSet API引入了特殊的同步(基于超步)迭代,这只可能在有界流上实现。有关详细信息,请查看iteration文档。
下一步
接下来继续了解Flink分布式运行时(Distributed Runtime)的基本概念。