flink状态

Flink从外部数据源持续接收数据，每接收一条数据就会触发相应的计算操作。当Flink对数据进行聚合操作时，不可能将所有流入的数据重新计算一次，每次计算都是基于上一次计算结果做增量计算。其中上一次计算的计算结果以状态State的形式存储，并持久化到状态后端中，从而在系统故障时进行状态恢复。

什么是状态

State是指流计算过程中计算节点的中间计算结果或元数据属性，比如在aggregation过程中要在state中记录中间聚合结果，比如 Apache Kafka 作为数据源时候，我们也要记录已经读取记录的offset，这些State数据在计算过程中会进行持久化(插入或更新)。所以Apache Flink中的State就是与时间相关的，Apache Flink任务的内部数据（计算数据和元数据属性）的快照。

为什么需要State

与批计算相比，State是流计算特有的，批计算没有failover机制，要么成功，要么重新计算。流计算在大多数场景下是增量计算，数据逐条处理（大多数场景)，每次计算是在上一次计算结果之上进行处理的，这样的机制势必要将上一次的计算结果进行存储（生产模式要持久化），另外由于机器，网络，脏数据等原因导致的程序错误，在重启job时候需要从成功的检查点(checkpoint)进行state的恢复。增量计算，Failover这些机制都需要state的支撑。

State 分类

Apache Flink 内部按照算子和数据分组角度将State划分为如下两类：

KeyedState ：这里面的key是我们在SQL语句中对应的GroupBy/PartitioneBy里面的字段，API中的keyBy字段。key的值就是groupby/PartitionBy字段组成的Row的字节数组，每一个key都有一个属于自己的State，key与key之间的State是不可见的，KeyedState 只能使用在KeyStream上的操作和函数。支持的数据结构有ValueState、ListState、MapState、AggregatingState、ReducingState。

OperatorState ：与并行操作实例绑定，比如kafka consumer会保留每一个分区的偏移量。可以通过实现CheckpointedFunction接口或者ListCheckpointed接口，实现OperatorState。

State 扩容重新分配

Apache Flink是一个大规模并行分布式系统，允许大规模的有状态流处理。为了可伸缩性，Apache Flink作业在逻辑上被分解成operator graph，并且每个operator的执行被物理地分解成多个并行运算符实例。从概念上讲，Apache Flink中的每个并行运算符实例都是一个独立的任务，可以在自己的机器上调度到网络连接的其他机器运行。

Apache Flink的DAG图中只有边相连的节点有网络通信（上下游operate之间），也就是整个DAG在垂直方向有网络IO，在水平方向（并行的子任务之间）如下图的stateful节点之间没有网络通信，这种模型也保证了每个operator实例维护一份自己的state，并且保存在本地磁盘（远程异步同步）。通过这种设计，任务的所有状态数据都是本地的，并且状态访问不需要任务之间的网络通信。避免这种流量对于像Apache Flink这样的大规模并行分布式系统的可扩展性至关重要。

如上我们知道Apache Flink中State有OperatorState和KeyedState，那么在进行扩容时候（增加并发）State如何重新分配呢？比如：外部Source有5个partition，在Apache Flink上面由Srouce的1个并发扩容到2个并发，中间Stateful Operation 节点由2个并发并扩容的3个并发，如下图所示:

OperatorState扩容

OperatorState对扩容的处理

OperatorState最直接的实现就是Source记录读取源的偏移量，这个状态比较小，直接通过取余的方式实现。

kafka分区与子任务的对应关系：

    public static int assign(KafkaTopicPartition partition, int numParallelSubtasks) {
        int startIndex = ((partition.getTopic().hashCode() * 31) & 0x7FFFFFFF) % numParallelSubtasks;

        // here, the assumption is that the id of Kafka partitions are always ascending
        // starting from 0, and therefore can be used directly as the offset clockwise from the start index
        return (startIndex + partition.getPartition()) % numParallelSubtasks;
    }

source调整并行度状态转移图：

Source的扩容（并发数）不可以超过Source物理存储的partition数量。目前Apache Flink的做法是提前报错，即使不报错也是资源的浪费，因为超过partition数量的并发永远分配不到待管理的partition。

KeyedState对扩容的处理

对于KeyedState最容易想到的是hash(key) mod parallelism(operator) 方式分配state，就和OperatorState一样，这种分配方式大多数情况是恢复的state不是本地已有的state，需要一次网络拷贝，这种效率比较低，OperatorState采用这种简单的方式进行处理是因为OperatorState的state一般都比较小（只保存了分区ID和消费的偏移量），网络拉取的成本很小。对于KeyedState往往很大，如果使用取余的方式，意味着绝大数的状态都会被重新分配到其他节点，网络拉取的成本很大。在Apache Flink中采用的是Key-Groups方式进行分配。

什么是Key-Groups

Key-Groups 是Apache Flink中对keyed state按照key进行分组的方式，每个key-group中会包含N>0个key，一个key-group是State分配的原子单位。在Apache Flink中关于Key-Group的对象是 KeyGroupRange, 如下:

public class KeyGroupRange implements KeyGroupsList, Serializable {
   private final int startKeyGroup;
   private final int endKeyGroup;

   public KeyGroupRange(int startKeyGroup, int endKeyGroup) {
       Preconditions.checkArgument(startKeyGroup >= 0);
       Preconditions.checkArgument(startKeyGroup <= endKeyGroup);
       this.startKeyGroup = startKeyGroup;
       this.endKeyGroup = endKeyGroup;
       Preconditions.checkArgument(getNumberOfKeyGroups() >= 0, "Potential overflow detected.");
   }
}

KeyGroupRange两个重要的属性就是 startKeyGroup和endKeyGroup，定义了startKeyGroup和endKeyGroup属性后Operator上面的Key-Group的个数也就确定了。

什么决定Key-Groups的个数

key-group的数量在job启动前必须是确定的且运行中不能改变。由于key-group是state分配的原子单位，而每个operator并行实例至少包含一个key-group，因此operator的最大并行度不能超过设定的key-group的个数，那么在Apache Flink的内部实现上key-group的数量就是最大并行度的值。

如何决定key属于哪个Key-Group

确定好GroupRange之后，如何决定每个Key属于哪个Key-Group呢？我们采取的是取mod的方式，在KeyGroupRangeAssignment中的assignToKeyGroup方法会将key划分到指定的key-group中，如下：

    /**
     * Assigns the given key to a key-group index.
     *
     * @param key the key to assign
     * @param maxParallelism the maximum supported parallelism, aka the number of key-groups.
     * @return the key-group to which the given key is assigned
     */
    public static int assignToKeyGroup(Object key, int maxParallelism) {
        return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
    }

    /**
     * Assigns the given key to a key-group index.
     *
     * @param keyHash the hash of the key to assign
     * @param maxParallelism the maximum supported parallelism, aka the number of key-groups.
     * @return the key-group to which the given key is assigned
     */
    public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
        return MathUtils.murmurHash(keyHash) % maxParallelism;
    }
    /**
     * Assigns the given key to a key-group index.
     *
     * @param keyHash the hash of the key to assign
     * @param maxParallelism the maximum supported parallelism, aka the number of key-groups.
     * @return the key-group to which the given key is assigned
     */
    public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
        return MathUtils.murmurHash(keyHash) % maxParallelism;
    }

如上实现我们了解到分配Key到指定的key-group的逻辑是利用key的hashCode和maxParallelism进行取余操作来分配的。如下图当parallelism=2,maxParallelism=10的情况下流上key与key-group的对应关系如下图所示

如上图key(a)的hashCode是97，与最大并发10取余后是7，被分配到了KG-7中，流上每个event都会分配到KG-0至KG-9其中一个Key-Group中。

每个Operator实例如何获取Key-Groups

了解了Key-Groups概念和如何分配每个Key到指定的Key-Groups之后，我们看看如何计算每个Operator实例所处理的Key-Groups。在KeyGroupRangeAssignment的computeKeyGroupRangeForOperatorIndex方法描述了分配算法：

    /**
     * Computes the range of key-groups that are assigned to a given operator under the given parallelism and maximum
     * parallelism.
     *
     * IMPORTANT: maxParallelism must be <= Short.MAX_VALUE to avoid rounding problems in this method. If we ever want
     * to go beyond this boundary, this method must perform arithmetic on long values.
     *
     * @param maxParallelism Maximal parallelism that the job was initially created with.
     * @param parallelism    The current parallelism under which the job runs. Must be <= maxParallelism.
     * @param operatorIndex  Id of a key-group. 0 <= keyGroupID < maxParallelism.
     * @return the computed key-group range for the operator.
     */
    public static KeyGroupRange computeKeyGroupRangeForOperatorIndex(
        int maxParallelism,
        int parallelism,
        int operatorIndex) {

        checkParallelismPreconditions(parallelism);
        checkParallelismPreconditions(maxParallelism);

        Preconditions.checkArgument(maxParallelism >= parallelism,
            "Maximum parallelism must not be smaller than parallelism.");
        // 操作的key 分配的组
        int start = ((operatorIndex * maxParallelism + parallelism - 1) / parallelism);
        int end = ((operatorIndex + 1) * maxParallelism - 1) / parallelism;
        return new KeyGroupRange(start, end);
    }

上面代码的核心逻辑是先计算每个Operator实例至少分配的Key-Group个数，将不能整除的部分N个，平均分给前N个实例。最终每个Operator实例管理的Key-Groups会在GroupRange中表示，本质是一个区间值；下面我们就上图的case，说明一下如何进行分配以及扩容后如何重新分配。
假设上面的Stateful Operation节点的最大并行度maxParallelism的值是10，也就是我们一共有10个Key-Group，当我们并发是2的时候和并发是3的时候分配的情况如下图：

如上算法我们发现在进行扩容时候，大部分state还是落到本地的（最终目的），如Task0只有KG-4被分出去，其他的还是保持在本地。同时我们也发现，一个job如果修改了maxParallelism的值那么会直接影响到Key-Groups的数量和key的分配，也会打乱所有的Key-Group的分配，目前在Apache Flink系统中maxParallelism的生成策略如下：

  最大并行度计算：
  128 : for all parallelism <= 128.
  //上取整
  MIN(nextPowerOfTwo(parallelism + (parallelism / 2)), 2^15) : for all parallelism > 128.

原文参考：
Apache Flink 漫谈系列(04) - State
Production Readiness Checklis