背景:最近有一些同学问我关于状态后端的内容,其实关于flink statebackend的这篇文章我犹豫了很久要不要写,因为我觉得官网上面解释的已经挺详细的了,只要花一些时间浏览官网很快就能理解,最终还是落笔写下这篇文章最重要的原因是帮助自己整理回顾跟能给新的同学一些小帮助吧
前言:(Flink 版本 1.12)看这篇博客时,你应该是对flink state有了一定的了解,如果没有可以阅读 What is State?,先熟悉了解一下状态(State)这个概念
状态后端(State Backends) 指的是在进行有状态流计算时,对windows窗口,转换函数等中的计算值进行内存保存或者持久化保存下来,以提供给下一个算子/函数使用使其局部变量实现容错的机制。
以下来自官网内容解释:Apache Flink 1.12 Documentation: Stateful Stream Processing
State Backends
The exact data structures in which the key/values indexes are stored depends on the chosen state backend. One state backend stores data in an in-memory hash map, another state backend uses RocksDB as the key/value store. In addition to defining the data structure that holds the state, the state backends also implement the logic to take a point-in-time snapshot of the key/value state and store that snapshot as part of a checkpoint. State backends can be configured without changing your application logic.
存储键/值索引的确切数据结构取决于所选的状态后端。一个状态后端将数据存储在内存中的哈希Map中,另一个状态后端使用RocksDB作为键/值存储。除了定义保存状态的数据结构之外,状态后端还实现逻辑以获取键/值状态的时间点快照并将该快照存储为检查点的一部分。可以配置状态后端,而无需更改应用程序逻辑。
2.1 避免数据丢失
2.2 数据恢复
3.1 Windows 收集元素或集合
3.2 转换函数可以使用k/v 状态来存储值
3.3 转换函数可以实现CheckpointedFunction接口以使其局部变量容错
State Backends
Programs written in the Data Stream API often hold state in various forms:
- Windows gather elements or aggregates until they are triggered
- Transformation functions may use the key/value state interface to store values
- Transformation functions may implement the
CheckpointedFunction
interface to make their local variables fault tolerantSee also state section in the streaming API guide.
When checkpointing is activated, such state is persisted upon checkpoints to guard against data loss and recover consistently. How the state is represented internally, and how and where it is persisted upon checkpoints depends on the chosen State Backend.
官网链接如下 Available State Backends
Flink一共提供了三种开箱即用的State Backends,如下
默认情况下:如果没有修改过任何配置,程序默认使用的是MemoryStateBackend 也就是内存存储
The MemoryStateBackend holds data internally as objects on the Java heap. Key/value state and window operators hold hash tables that store the values, triggers, etc.
Upon checkpoints, this state backend will snapshot the state and send it as part of the checkpoint acknowledgement messages to the JobManager, which stores it on its heap as well.
The MemoryStateBackend can be configured to use asynchronous snapshots. While we strongly encourage the use of asynchronous snapshots to avoid blocking pipelines, please note that this is currently enabled by default. To disable this feature, users can instantiate a
MemoryStateBackend
with the corresponding boolean flag in the constructor set tofalse
(this should only used for debug), e.g.:new MemoryStateBackend(MAX_MEM_STATE_SIZE, false);
MemoryStateBackend保存数据在内部作为Java堆的对象。键/值状态和窗口运算符保存用于存储值,触发器等的哈希表。
在检查点上,此状态后端将对该状态进行快照,并将其作为检查点确认消息的一部分发送给JobManager,该JobManager还将状态存储在其堆中。
可以将MemoryStateBackend配置为使用异步快照。尽管我们强烈建议使用异步快照以避免阻塞管道,但是请注意,当前默认情况下启用此功能。要禁用此功能,用户可以MemoryStateBackend
在构造函数中将相应的布尔标志设置为false
(仅应用于调试),例如:
new MemoryStateBackend(MAX_MEM_STATE_SIZE, false);
The FsStateBackend is configured with a file system URL (type, address, path), such as “hdfs://namenode:40010/flink/checkpoints” or “file:///data/flink/checkpoints”.
The FsStateBackend holds in-flight data in the TaskManager’s memory. Upon checkpointing, it writes state snapshots into files in the configured file system and directory. Minimal metadata is stored in the JobManager’s memory (or, in high-availability mode, in the metadata checkpoint).
The FsStateBackend uses asynchronous snapshots by default to avoid blocking the processing pipeline while writing state checkpoints. To disable this feature, users can instantiate a
FsStateBackend
with the corresponding boolean flag in the constructor set tofalse
, e.g.:new FsStateBackend(path, false);
FsStateBackend配置有文件系统URL(类型,地址,路径),如 “hdfs://namenode:40010/flink/checkpoints” or “file:///data/flink/checkpoints”.
FsStateBackend将正在进行的数据保存在TaskManager的内存中。检查点完成后,它将状态快照写入配置的文件系统和目录中的文件中。最小的元数据存储在JobManager的内存中(或在高可用性模式下,存储在元数据检查点中)。
FsStateBackend默认情况下使用异步快照,以避免在写入状态检查点时阻塞处理管道。要禁用此功能,用户可以FsStateBackend
在构造函数中将boolean标志设置为 false
,例如:
new FsStateBackend(path, false);
// Java
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStateBackend(new FsStateBackend("hdfs://namenode:40010/flink/checkpoints"));
//Scala
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStateBackend(new FsStateBackend("hdfs://namenode:40010/flink/checkpoints"))
1、配置 flink-conf.yaml
A default state backend can be configured in the
flink-conf.yaml
, using the configuration keystate.backend
.Possible values for the config entry are jobmanager (MemoryStateBackend), filesystem (FsStateBackend), rocksdb (RocksDBStateBackend), or the fully qualified class name of the class that implements the state backend factory StateBackendFactory, such as
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
for RocksDBStateBackend.The
state.checkpoints.dir
option defines the directory to which all backends write checkpoint data and meta data files. You can find more details about the checkpoint directory structure here.
# The backend that will be used to store operator state checkpoints
state.backend: filesystem
# Directory for storing checkpoints
state.checkpoints.dir: hdfs://namenode:40010/flink/checkpoints
The RocksDBStateBackend is configured with a file system URL (type, address, path), such as “hdfs://namenode:40010/flink/checkpoints” or “file:///data/flink/checkpoints”.
The RocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager data directories. Upon checkpointing, the whole RocksDB database will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory (or, in high-availability mode, in the metadata checkpoint).
The RocksDBStateBackend always performs asynchronous snapshots.
RocksDBStateBackend配置有文件系统URL(类型,地址,路径),如“hdfs://namenode:40010/flink/checkpoints” or “file:///data/flink/checkpoints”.
RocksDBStateBackend将RocksDB数据库中的运行中数据(默认情况下)保存在TaskManager数据目录中。经过检查点后,整个RocksDB数据库将被检查点写入到配置文件系统和目录中。最小的元数据存储在JobManager的内存中(或在高可用性模式下,存储在元数据检查点中)。
RocksDBStateBackend始终执行异步快照。
请注意,可以保留的状态量仅受可用磁盘空间量的限制。与将状态保持在内存中的FsStateBackend相比,这可以保持非常大的状态。但是,这也意味着,使用此状态后端,可以实现的最大吞吐量将降低。从该后端进行的所有读/写都必须经过反序列化以检索/存储状态对象,这也比基于堆的后端在进行堆上表示时要昂贵得多。
还要检查有关RocksDBStateBackend的任务执行程序内存配置的建议。
RocksDBStateBackend是目前唯一提供增量检查点的后端(请参阅此处)。
某些RocksDB本机指标可用,但默认情况下是禁用的,您可以在此处找到完整的文档
每个插槽的RocksDB实例的总内存量也可以限制,有关详细信息,请参阅此处的文档。
pom.xml 文件引入jar包
org.apache.flink
flink-statebackend-rocksdb_2.11
1.12.0
provided
// Java
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStateBackend(new RocksDBStateBackend("hdfs://namenode:40010/flink/checkpoints"));
//Scala
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStateBackend(new RocksDBStateBackend("hdfs://namenode:40010/flink/checkpoints"))
1、配置 flink-conf.yaml
A default state backend can be configured in the
flink-conf.yaml
, using the configuration keystate.backend
.Possible values for the config entry are jobmanager (MemoryStateBackend), filesystem (FsStateBackend), rocksdb (RocksDBStateBackend), or the fully qualified class name of the class that implements the state backend factory StateBackendFactory, such as
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
for RocksDBStateBackend.The
state.checkpoints.dir
option defines the directory to which all backends write checkpoint data and meta data files. You can find more details about the checkpoint directory structure here.
# The backend that will be used to store operator state checkpoints
state.backend: rocksdb
# Directory for storing checkpoints
state.checkpoints.dir: hdfs://namenode:40010/flink/checkpoints
Currently, Flink’s savepoint binary format is state backend specific. A savepoint taken with one state backend cannot be restored using another, and you should carefully consider which backend you use before going to production.
In general, we recommend avoiding
MemoryStateBackend
in production because it stores its snapshots inside the JobManager as opposed to persistent disk. When deciding betweenFsStateBackend
andRocksDB
, it is a choice between performance and scalability.FsStateBackend
is very fast as each state access and update operates on objects on the Java heap; however, state size is limited by available memory within the cluster. On the other hand,RocksDB
can scale based on available disk space and is the only state backend to support incremental snapshots. However, each state access and update requires (de-)serialization and potentially reading from disk which leads to average performance that is an order of magnitude slower than the memory state backends.
当前,Flink的保存点二进制格式是特定于状态后端的。使用一个状态后端获取的保存点无法使用另一状态恢复,因此在生产之前,应仔细考虑使用哪个后端。
通常,我们建议避免MemoryStateBackend
在生产环境中使用,因为它会将快照存储在JobManager中,而不是存储在永久性磁盘中。在FsStateBackend
和之间RocksDB
进行选择时,可以在性能和可伸缩性之间进行选择。 FsStateBackend
每个状态访问和更新都对Java堆上的对象进行操作,因此速度非常快;但是,状态大小受群集内可用内存的限制。另一方面,它RocksDB
可以根据可用磁盘空间进行扩展,并且是唯一支持增量快照的状态后端。但是,每个状态访问和更新都需要(反)序列化,并且可能需要从磁盘读取数据,这导致平均性能比内存状态后端慢一个数量级。
Limitations of the MemoryStateBackend:
- The size of each individual state is by default limited to 5 MB. This value can be increased in the constructor of the MemoryStateBackend.
- Irrespective of the configured maximal state size, the state cannot be larger than the akka frame size (see Configuration).
- The aggregate state must fit into the JobManager memory.
The MemoryStateBackend is encouraged for:
- Local development and debugging
- Jobs that do hold little state, such as jobs that consist only of record-at-a-time functions (Map, FlatMap, Filter, …). The Kafka Consumer requires very little state.
It is also recommended to set managed memory to zero. This will ensure that the maximum amount of memory is allocated for user code on the JVM
MemoryStateBackend的局限性:
鼓励使用MemoryStateBackend用于:
还建议将托管内存设置为零。这将确保在JVM上为用户代码分配最大的内存量。
The FsStateBackend is encouraged for:
- Jobs with large state, long windows, large key/value states.
- All high-availability setups.
鼓励FsStateBackend用于:
- 状态较大,窗口较长,键/值状态较大的作业。
- 所有高可用性设置。
还建议将托管内存设置为零。这将确保在JVM上为用户代码分配最大的内存量。
RocksDBStateBackend的局限性:
As RocksDB’s JNI bridge API is based on byte[], the maximum supported size per key and per value is 2^31 bytes each. IMPORTANT: states that use merge operations in RocksDB (e.g. ListState) can silently accumulate value sizes > 2^31 bytes and will then fail on their next retrieval. This is currently a limitation of RocksDB JNI
1、由于RocksDB的JNI桥接API基于byte [],因此每个键和每个值的最大支持大小分别为2 ^ 31个字节。重要说明:在RocksDB中使用合并操作的状态(例如ListState)可以静默累积值大小> 2 ^ 31字节,然后在下一次检索时将失败。目前,这是RocksDB JNI的限制。
有关于更多RocksDBStateBackend实现细节:
RocksDB State Backend Details
This section describes the RocksDB state backend in more detail.
Incremental Checkpoints
RocksDB supports Incremental Checkpoints, which can dramatically reduce the checkpointing time in comparison to full checkpoints. Instead of producing a full, self-contained backup of the state backend, incremental checkpoints only record the changes that happened since the latest completed checkpoint.
An incremental checkpoint builds upon (typically multiple) previous checkpoints. Flink leverages RocksDB’s internal compaction mechanism in a way that is self-consolidating over time. As a result, the incremental checkpoint history in Flink does not grow indefinitely, and old checkpoints are eventually subsumed and pruned automatically.
Recovery time of incremental checkpoints may be longer or shorter compared to full checkpoints. If your network bandwidth is the bottleneck, it may take a bit longer to restore from an incremental checkpoint, because it implies fetching more data (more deltas). Restoring from an incremental checkpoint is faster, if the bottleneck is your CPU or IOPs, because restoring from an incremental checkpoint means not re-building the local RocksDB tables from Flink’s canonical key/value snapshot format (used in savepoints and full checkpoints).
While we encourage the use of incremental checkpoints for large state, you need to enable this feature manually:
- Setting a default in your
flink-conf.yaml
:state.backend.incremental: true
will enable incremental checkpoints, unless the application overrides this setting in the code.- You can alternatively configure this directly in the code (overrides the config default):
RocksDBStateBackend backend = new RocksDBStateBackend(checkpointDirURI, true);
Notice that once incremental checkpoont is enabled, the
Checkpointed Data Size
showed in web UI only represents the delta checkpointed data size of that checkpoint instead of full state size.Memory Management
Flink aims to control the total process memory consumption to make sure that the Flink TaskManagers have a well-behaved memory footprint. That means staying within the limits enforced by the environment (Docker/Kubernetes, Yarn, etc) to not get killed for consuming too much memory, but also to not under-utilize memory (unnecessary spilling to disk, wasted caching opportunities, reduced performance).
To achieve that, Flink by default configures RocksDB’s memory allocation to the amount of managed memory of the TaskManager (or, more precisely, task slot). This should give good out-of-the-box experience for most applications, meaning most applications should not need to tune any of the detailed RocksDB settings. The primary mechanism for improving memory-related performance issues would be to simply increase Flink’s managed memory.
Users can choose to deactivate that feature and let RocksDB allocate memory independently per ColumnFamily (one per state per operator). This offers expert users ultimately more fine grained control over RocksDB, but means that users need to take care themselves that the overall memory consumption does not exceed the limits of the environment. See large state tuning for some guideline about large state performance tuning.
Managed Memory for RocksDB
This feature is active by default and can be (de)activated via the
state.backend.rocksdb.memory.managed
configuration key.Flink does not directly manage RocksDB’s native memory allocations, but configures RocksDB in a certain way to ensure it uses exactly as much memory as Flink has for its managed memory budget. This is done on a per-slot level (managed memory is accounted per slot).
To set the total memory usage of RocksDB instance(s), Flink leverages a shared cache and write buffer manager among all instances in a single slot. The shared cache will place an upper limit on the three components that use the majority of memory in RocksDB: block cache, index and bloom filters, and MemTables.
For advanced tuning, Flink also provides two parameters to control the division of memory between the write path (MemTable) and read path (index & filters, remaining cache). When you see that RocksDB performs badly due to lack of write buffer memory (frequent flushes) or cache misses, you can use these parameters to redistribute the memory.
state.backend.rocksdb.memory.write-buffer-ratio
, by default0.5
, which means 50% of the given memory would be used by write buffer manager.state.backend.rocksdb.memory.high-prio-pool-ratio
, by default0.1
, which means 10% of the given memory would be set as high priority for index and filters in shared block cache. We strongly suggest not to set this to zero, to prevent index and filters from competing against data blocks for staying in cache and causing performance issues. Moreover, the L0 level filter and index are pinned into the cache by default to mitigate performance problems, more details please refer to the RocksDB-documentation.Note When the above described mechanism (
cache
andwrite buffer manager
) is enabled, it will override any customized settings for block caches and write buffers done via PredefinedOptions and RocksDBOptionsFactory.Note Expert Mode: To control memory manually, you can set
state.backend.rocksdb.memory.managed
tofalse
and configure RocksDB via ColumnFamilyOptions. Alternatively, you can use the above mentioned cache/buffer-manager mechanism, but set the memory size to a fixed amount independent of Flink’s managed memory size (state.backend.rocksdb.memory.fixed-per-slot
option). Note that in both cases, users need to ensure on their own that enough memory is available outside the JVM for RocksDB.Timers (Heap vs. RocksDB)
Timers are used to schedule actions for later (event-time or processing-time), such as firing a window, or calling back a
ProcessFunction
.When selecting the RocksDB State Backend, timers are by default also stored in RocksDB. That is a robust and scalable way that lets applications scale to many timers. However, maintaining timers in RocksDB can have a certain cost, which is why Flink provides the option to store timers on the JVM heap instead, even when RocksDB is used to store other states. Heap-based timers can have a better performance when there is a smaller number of timers.
Set the configuration option
state.backend.rocksdb.timer-service.factory
toheap
(rather than the default,rocksdb
) to store timers on heap.Note The combination RocksDB state backend with heap-based timers currently does NOT support asynchronous snapshots for the timers state. Other state like keyed state is still snapshotted asynchronously.
Note When using RocksDB state backend with heap-based timers, checkpointing and taking savepoints is expected to fail if there are operators in application that write to raw keyed state.
Enabling RocksDB Native Metrics
You can optionally access RockDB’s native metrics through Flink’s metrics system, by enabling certain metrics selectively. See configuration docs for details.
Note: Enabling RocksDB's native metrics may have a negative performance impact on your application.
Predefined Per-ColumnFamily Options
Note With the introduction of memory management for RocksDB this mechanism should be mainly used for expert tuning or trouble shooting.
With Predefined Options, users can apply some predefined config profiles on each RocksDB Column Family, configuring for example memory use, thread, compaction settings, etc. There is currently one Column Family per each state in each operator.
There are two ways to select predefined options to be applied:
- Set the option’s name in
flink-conf.yaml
viastate.backend.rocksdb.predefined-options
.- Set the predefined options programmatically:
RocksDBStateBackend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM)
.The default value for this option is
DEFAULT
which translates toPredefinedOptions.DEFAULT
.Note Predefined options set programmatically would override the ones configured via
flink-conf.yaml
.Passing Options Factory to RocksDB
Note With the introduction of memory management for RocksDB this mechanism should be mainly used for expert tuning or trouble shooting.
To manually control RocksDB’s options, you need to configure an
RocksDBOptionsFactory
. This mechanism gives you fine-grained control over the settings of the Column Families, for example memory use, thread, compaction settings, etc. There is currently one Column Family per each state in each operator.There are two ways to pass a RocksDBOptionsFactory to the RocksDB State Backend:
Configure options factory class name in the
flink-conf.yaml
viastate.backend.rocksdb.options-factory
.Set the options factory programmatically, e.g.
RocksDBStateBackend.setRocksDBOptions(new MyOptionsFactory());
Note Options factory which set programmatically would override the one configured via
flink-conf.yaml
, and options factory has a higher priority over the predefined options if ever configured or set.Note RocksDB is a native library that allocates memory directly from the process, and not from the JVM. Any memory you assign to RocksDB will have to be accounted for, typically by decreasing the JVM heap size of the TaskManagers by the same amount. Not doing that may result in YARN/Mesos/etc terminating the JVM processes for allocating more memory than configured.
Reading Column Family Options from flink-conf.yaml
When a
RocksDBOptionsFactory
implements theConfigurableRocksDBOptionsFactory
interface, it can directly read settings from the configuration (flink-conf.yaml
).The default value for
state.backend.rocksdb.options-factory
is in factorg.apache.flink.contrib.streaming.state.DefaultConfigurableOptionsFactory
which picks up all config options defined here by default. Hence, you can configure low-level Column Family options simply by turning off managed memory for RocksDB and putting the relevant entries in the configuration.Below is an example how to define a custom ConfigurableOptionsFactory (set class name under
state.backend.rocksdb.options-factory
).public class MyOptionsFactory implements ConfigurableRocksDBOptionsFactory { private static final long DEFAULT_SIZE = 256 * 1024 * 1024; // 256 MB private long blockCacheSize = DEFAULT_SIZE; @Override public DBOptions createDBOptions(DBOptions currentOptions, Collection
handlesToClose) { return currentOptions.setIncreaseParallelism(4) .setUseFsync(false); } @Override public ColumnFamilyOptions createColumnOptions( ColumnFamilyOptions currentOptions, Collection handlesToClose) { return currentOptions.setTableFormatConfig( new BlockBasedTableConfig() .setBlockCacheSize(blockCacheSize) .setBlockSize(128 * 1024)); // 128 KB } @Override public RocksDBOptionsFactory configure(Configuration configuration) { this.blockCacheSize = configuration.getLong("my.custom.rocksdb.block.cache.size", DEFAULT_SIZE); return this; } }