Flink DataStream API之partition

Random partitioning:随机分区

dataStream.shuffle()

Rebalancing:对数据集进行再平衡,重分区,消除数据倾斜

dataStream.rebalance()

从源码中关键代码,可看出是partition的数据重新分配,以达到完全的均衡

import org.apache.flink.annotation.Internal;
import org.apache.flink.runtime.plugable.SerializationDelegate;
import org.apache.flink.streaming.runtime.streamrecord.StreamRecord;

import java.util.concurrent.ThreadLocalRandom;

/**
 * Partitioner that distributes the data equally by cycling through the output
 * channels.
 *
 * @param  Type of the elements in the Stream being rebalanced
 */
@Internal
public class RebalancePartitioner extends StreamPartitioner {
	private static final long serialVersionUID = 1L;

	private final int[] returnArray = {Integer.MAX_VALUE - 1};

	@Override
	public int[] selectChannels(
			SerializationDelegate> record,
			int numChannels) {
		int newChannel = ++returnArray[0];
		if (newChannel >= numChannels) {
			returnArray[0] = resetValue(numChannels, newChannel);
		}
		return returnArray;
	}

	private static int resetValue(
			int numChannels,
			int newChannel) {
		if (newChannel == Integer.MAX_VALUE) {
			// Initializes the first partition, this branch is only entered when initializing.
			return ThreadLocalRandom.current().nextInt(numChannels);
		}
		return 0;
	}

	public StreamPartitioner copy() {
		return this;
	}

	@Override
	public String toString() {
		return "REBALANCE";
	}
}

Rescaling:如果上游操作有2个并发,而下游操作有4个并发,那么上游的一个并发结果分配给下游的两个并发操作,另外的一个并发结果分配给了下游的另外两个并发操作.另一方面,下游有两个并发操作而上游又4个并发操作,那么上游的其中两个操作的结果分配给下游的一个并发操作而另外两个并发操作的结果则分配给另外一个并发操作。

Rescaling与Rebalancing的区别:Rebalancing会产生全量重分区,而Rescaling不会。

Custom partitioning:自定义分区(自定义分区需要实现Partitioner接口)

dataStream.partitionCustom(partitioner, "someKey") 或者 dataStream.partitionCustom(partitioner, 0);

自定义分区demo:

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple1;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.api.common.functions.Partitioner;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

public class TestCustomPartition {

    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(2);
        DataStream text = env.addSource(new NoParalleSource());
        SingleOutputStreamOperator> map = text.map(new MapFunction>() {
            @Override
            public Tuple1 map(Long value) throws Exception {
                return new Tuple1<>(value);
            }
        });
        DataStream> stream = map.partitionCustom(new MyPartition(), 0);
        SingleOutputStreamOperator result = stream.map(new MapFunction, Long>() {
            @Override
            public Long map(Tuple1 value) throws Exception {
                System.out.println("当前线程id:" + Thread.currentThread().getId() + ",value: " + value);
                return value.getField(0);
            }
        });
        result.print();
        env.execute("TestCustomPartition");
    }
public static class MyPartition implements Partitioner {
    @Override
    public int partition(Long key, int i) {
        System.out.println("分区总数:"+ i);
        if(key % 2 == 0){
            return 0;
        }else{
            return 1;
        }
    }
}


public static class NoParalleSource implements SourceFunction {
    private long count =1;
    private boolean isRun = true;
    @Override
    public void run(SourceContext ctx) throws Exception {
        while (isRun) {
            ctx.collect(count++);
            Thread.sleep(1000);
        }
    }

    @Override
    public void cancel() {
        isRun = false;
    }
}
}

你可能感兴趣的:(Flink DataStream API之partition)