聊聊flink KeyedStream的reduce操作

本文主要研究一下flink KeyedStream的reduce操作

实例

    @Test
    public void testWordCount() throws Exception {
        // Checking input parameters
//        final ParameterTool params = ParameterTool.fromArgs(args);

        // set up the execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // make parameters available in the web interface
//        env.getConfig().setGlobalJobParameters(params);

        // get input data
        DataStream text = env.fromElements(WORDS);

        DataStream> counts =
                // split up the lines in pairs (2-tuples) containing: (word,1)
                text.flatMap(new Tokenizer())
                        // group by the tuple field "0" and sum up tuple field "1"
                        .keyBy(0)
                        .reduce(new ReduceFunction>() {
                            @Override
                            public Tuple2 reduce(Tuple2 value1, Tuple2 value2) {
                                System.out.println("value1:"+value1.f1+";value2:"+value2.f1);
                                return new Tuple2<>(value1.f0, value1.f1 + value2.f1);
                            }
                        });

        // emit result
        System.out.println("Printing result to stdout. Use --output to specify output path.");
        counts.print();

        // execute program
        env.execute("Streaming WordCount");
    }
  • 这里对KeyedStream进行reduce操作,自定义了ReduceFunction,在reduce方法里头累加word的计数

KeyedStream.reduce

flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/datastream/KeyedStream.java

@Public
public class KeyedStream extends DataStream {
    //......

    /**
     * Applies a reduce transformation on the grouped data stream grouped on by
     * the given key position. The {@link ReduceFunction} will receive input
     * values based on the key value. Only input values with the same key will
     * go to the same reducer.
     *
     * @param reducer
     *            The {@link ReduceFunction} that will be called for every
     *            element of the input values with the same key.
     * @return The transformed DataStream.
     */
    public SingleOutputStreamOperator reduce(ReduceFunction reducer) {
        return transform("Keyed Reduce", getType(), new StreamGroupedReduce(
                clean(reducer), getType().createSerializer(getExecutionConfig())));
    }

    @Override
    @PublicEvolving
    public  SingleOutputStreamOperator transform(String operatorName,
            TypeInformation outTypeInfo, OneInputStreamOperator operator) {

        SingleOutputStreamOperator returnStream = super.transform(operatorName, outTypeInfo, operator);

        // inject the key selector and key type
        OneInputTransformation transform = (OneInputTransformation) returnStream.getTransformation();
        transform.setStateKeySelector(keySelector);
        transform.setStateKeyType(keyType);

        return returnStream;
    }

    //......
}
  • KeyedStream的reduce方法调用了transform方法,而构造的OneInputStreamOperator为StreamGroupedReduce

ReduceFunction

flink-core-1.7.0-sources.jar!/org/apache/flink/api/common/functions/ReduceFunction.java

@Public
@FunctionalInterface
public interface ReduceFunction extends Function, Serializable {

    /**
     * The core method of ReduceFunction, combining two values into one value of the same type.
     * The reduce function is consecutively applied to all values of a group until only a single value remains.
     *
     * @param value1 The first value to combine.
     * @param value2 The second value to combine.
     * @return The combined value of both input values.
     *
     * @throws Exception This method may throw exceptions. Throwing an exception will cause the operation
     *                   to fail and may trigger recovery.
     */
    T reduce(T value1, T value2) throws Exception;
}
  • ReduceFunction定义了reduce方法,它主要是用来将两个同类型的值操作为一个同类型的值,第一个参数为前面reduce的结果,第二参数为当前的元素

Task.run

flink-runtime_2.11-1.7.0-sources.jar!/org/apache/flink/runtime/taskmanager/Task.java

/**
 * The Task represents one execution of a parallel subtask on a TaskManager.
 * A Task wraps a Flink operator (which may be a user function) and
 * runs it, providing all services necessary for example to consume input data,
 * produce its results (intermediate result partitions) and communicate
 * with the JobManager.
 *
 * 

The Flink operators (implemented as subclasses of * {@link AbstractInvokable} have only data readers, -writers, and certain event callbacks. * The task connects those to the network stack and actor messages, and tracks the state * of the execution and handles exceptions. * *

Tasks have no knowledge about how they relate to other tasks, or whether they * are the first attempt to execute the task, or a repeated attempt. All of that * is only known to the JobManager. All the task knows are its own runnable code, * the task's configuration, and the IDs of the intermediate results to consume and * produce (if any). * *

Each Task is run by one dedicated thread. */ public class Task implements Runnable, TaskActions, CheckpointListener { //...... /** * The core work method that bootstraps the task and executes its code. */ @Override public void run() { // ---------------------------- // Initial State transition // ---------------------------- //...... // all resource acquisitions and registrations from here on // need to be undone in the end Map> distributedCacheEntries = new HashMap<>(); AbstractInvokable invokable = null; try { // now load and instantiate the task's invokable code invokable = loadAndInstantiateInvokable(userCodeClassLoader, nameOfInvokableClass, env); // ---------------------------------------------------------------- // actual task core work // ---------------------------------------------------------------- // we must make strictly sure that the invokable is accessible to the cancel() call // by the time we switched to running. this.invokable = invokable; // switch to the RUNNING state, if that fails, we have been canceled/failed in the meantime if (!transitionState(ExecutionState.DEPLOYING, ExecutionState.RUNNING)) { throw new CancelTaskException(); } // notify everyone that we switched to running taskManagerActions.updateTaskExecutionState(new TaskExecutionState(jobId, executionId, ExecutionState.RUNNING)); // make sure the user code classloader is accessible thread-locally executingThread.setContextClassLoader(userCodeClassLoader); // run the invokable invokable.invoke(); //...... } catch (Throwable t) { //...... } finally { //...... } } }

  • Task的run方法会调用invokable.invoke(),这里的invokable为OneInputStreamTask,而OneInputStreamTask继承了StreamTask,这里实际调用的invoke()方法是StreamTask里头的

StreamTask.invoke

flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/runtime/tasks/StreamTask.java

@Internal
public abstract class StreamTask>
        extends AbstractInvokable
        implements AsyncExceptionHandler {

    //......

    protected abstract void run() throws Exception;

    @Override
    public final void invoke() throws Exception {

        boolean disposed = false;
        try {
            // -------- Initialize ---------
            LOG.debug("Initializing {}.", getName());

            asyncOperationsThreadPool = Executors.newCachedThreadPool();

            CheckpointExceptionHandlerFactory cpExceptionHandlerFactory = createCheckpointExceptionHandlerFactory();

            synchronousCheckpointExceptionHandler = cpExceptionHandlerFactory.createCheckpointExceptionHandler(
                getExecutionConfig().isFailTaskOnCheckpointError(),
                getEnvironment());

            asynchronousCheckpointExceptionHandler = new AsyncCheckpointExceptionHandler(this);

            stateBackend = createStateBackend();
            checkpointStorage = stateBackend.createCheckpointStorage(getEnvironment().getJobID());

            // if the clock is not already set, then assign a default TimeServiceProvider
            if (timerService == null) {
                ThreadFactory timerThreadFactory = new DispatcherThreadFactory(TRIGGER_THREAD_GROUP,
                    "Time Trigger for " + getName(), getUserCodeClassLoader());

                timerService = new SystemProcessingTimeService(this, getCheckpointLock(), timerThreadFactory);
            }

            operatorChain = new OperatorChain<>(this, streamRecordWriters);
            headOperator = operatorChain.getHeadOperator();

            // task specific initialization
            init();

            // save the work of reloading state, etc, if the task is already canceled
            if (canceled) {
                throw new CancelTaskException();
            }

            // -------- Invoke --------
            LOG.debug("Invoking {}", getName());

            // we need to make sure that any triggers scheduled in open() cannot be
            // executed before all operators are opened
            synchronized (lock) {

                // both the following operations are protected by the lock
                // so that we avoid race conditions in the case that initializeState()
                // registers a timer, that fires before the open() is called.

                initializeState();
                openAllOperators();
            }

            // final check to exit early before starting to run
            if (canceled) {
                throw new CancelTaskException();
            }

            // let the task do its work
            isRunning = true;
            run();

            // if this left the run() method cleanly despite the fact that this was canceled,
            // make sure the "clean shutdown" is not attempted
            if (canceled) {
                throw new CancelTaskException();
            }

            LOG.debug("Finished task {}", getName());

            //......
        }
        finally {
            //......
        }
    }
}
  • StreamTask的invoke方法会调用run方法,该方法为抽象方法,由子类实现,这里就是OneInputStreamTask的run方法

OneInputStreamTask.run

flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/runtime/tasks/OneInputStreamTask.java

@Internal
public class OneInputStreamTask extends StreamTask> {

    private StreamInputProcessor inputProcessor;

    private volatile boolean running = true;

    private final WatermarkGauge inputWatermarkGauge = new WatermarkGauge();

    /**
     * Constructor for initialization, possibly with initial state (recovery / savepoint / etc).
     *
     * @param env The task environment for this task.
     */
    public OneInputStreamTask(Environment env) {
        super(env);
    }

    /**
     * Constructor for initialization, possibly with initial state (recovery / savepoint / etc).
     *
     * 

This constructor accepts a special {@link ProcessingTimeService}. By default (and if * null is passes for the time provider) a {@link SystemProcessingTimeService DefaultTimerService} * will be used. * * @param env The task environment for this task. * @param timeProvider Optionally, a specific time provider to use. */ @VisibleForTesting public OneInputStreamTask( Environment env, @Nullable ProcessingTimeService timeProvider) { super(env, timeProvider); } @Override public void init() throws Exception { StreamConfig configuration = getConfiguration(); TypeSerializer inSerializer = configuration.getTypeSerializerIn1(getUserCodeClassLoader()); int numberOfInputs = configuration.getNumberOfInputs(); if (numberOfInputs > 0) { InputGate[] inputGates = getEnvironment().getAllInputGates(); inputProcessor = new StreamInputProcessor<>( inputGates, inSerializer, this, configuration.getCheckpointMode(), getCheckpointLock(), getEnvironment().getIOManager(), getEnvironment().getTaskManagerInfo().getConfiguration(), getStreamStatusMaintainer(), this.headOperator, getEnvironment().getMetricGroup().getIOMetricGroup(), inputWatermarkGauge); } headOperator.getMetricGroup().gauge(MetricNames.IO_CURRENT_INPUT_WATERMARK, this.inputWatermarkGauge); // wrap watermark gauge since registered metrics must be unique getEnvironment().getMetricGroup().gauge(MetricNames.IO_CURRENT_INPUT_WATERMARK, this.inputWatermarkGauge::getValue); } @Override protected void run() throws Exception { // cache processor reference on the stack, to make the code more JIT friendly final StreamInputProcessor inputProcessor = this.inputProcessor; while (running && inputProcessor.processInput()) { // all the work happens in the "processInput" method } } @Override protected void cleanup() throws Exception { if (inputProcessor != null) { inputProcessor.cleanup(); } } @Override protected void cancelTask() { running = false; } }

  • OneInputStreamTask的run方法会不断循环调用inputProcessor.processInput(),inputProcessor这里为StreamInputProcessor

StreamInputProcessor.processInput

flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/runtime/io/StreamInputProcessor.java

@Internal
public class StreamInputProcessor {

    //......

    public boolean processInput() throws Exception {
        if (isFinished) {
            return false;
        }
        if (numRecordsIn == null) {
            try {
                numRecordsIn = ((OperatorMetricGroup) streamOperator.getMetricGroup()).getIOMetricGroup().getNumRecordsInCounter();
            } catch (Exception e) {
                LOG.warn("An exception occurred during the metrics setup.", e);
                numRecordsIn = new SimpleCounter();
            }
        }

        while (true) {
            if (currentRecordDeserializer != null) {
                DeserializationResult result = currentRecordDeserializer.getNextRecord(deserializationDelegate);

                if (result.isBufferConsumed()) {
                    currentRecordDeserializer.getCurrentBuffer().recycleBuffer();
                    currentRecordDeserializer = null;
                }

                if (result.isFullRecord()) {
                    StreamElement recordOrMark = deserializationDelegate.getInstance();

                    if (recordOrMark.isWatermark()) {
                        // handle watermark
                        statusWatermarkValve.inputWatermark(recordOrMark.asWatermark(), currentChannel);
                        continue;
                    } else if (recordOrMark.isStreamStatus()) {
                        // handle stream status
                        statusWatermarkValve.inputStreamStatus(recordOrMark.asStreamStatus(), currentChannel);
                        continue;
                    } else if (recordOrMark.isLatencyMarker()) {
                        // handle latency marker
                        synchronized (lock) {
                            streamOperator.processLatencyMarker(recordOrMark.asLatencyMarker());
                        }
                        continue;
                    } else {
                        // now we can do the actual processing
                        StreamRecord record = recordOrMark.asRecord();
                        synchronized (lock) {
                            numRecordsIn.inc();
                            streamOperator.setKeyContextElement1(record);
                            streamOperator.processElement(record);
                        }
                        return true;
                    }
                }
            }

            //......
        }
    }

    //......
}
  • StreamInputProcessor的processInput方法,会在while true循环里头不断处理nextRecord,这里根据StreamElement的不同类型做不同处理,如果是普通的数据,则调用streamOperator.processElement进行处理,这里的streamOperator为StreamGroupedReduce

StreamGroupedReduce.processElement

flink-streaming-java_2.11-1.7.0-sources.jar!/org/apache/flink/streaming/api/operators/StreamGroupedReduce.java

/**
 * A {@link StreamOperator} for executing a {@link ReduceFunction} on a
 * {@link org.apache.flink.streaming.api.datastream.KeyedStream}.
 */

@Internal
public class StreamGroupedReduce extends AbstractUdfStreamOperator>
        implements OneInputStreamOperator {

    private static final long serialVersionUID = 1L;

    private static final String STATE_NAME = "_op_state";

    private transient ValueState values;

    private TypeSerializer serializer;

    public StreamGroupedReduce(ReduceFunction reducer, TypeSerializer serializer) {
        super(reducer);
        this.serializer = serializer;
    }

    @Override
    public void open() throws Exception {
        super.open();
        ValueStateDescriptor stateId = new ValueStateDescriptor<>(STATE_NAME, serializer);
        values = getPartitionedState(stateId);
    }

    @Override
    public void processElement(StreamRecord element) throws Exception {
        IN value = element.getValue();
        IN currentValue = values.value();

        if (currentValue != null) {
            IN reduced = userFunction.reduce(currentValue, value);
            values.update(reduced);
            output.collect(element.replace(reduced));
        } else {
            values.update(value);
            output.collect(element.replace(value));
        }
    }
}
  • StreamGroupedReduce使用了ValueState存储reduce操作的结果值,在processElement方法里头调用userFunction的reduce操作,userFunction就是用户自定义的ReduceFunction,而reduce的第一个参数就是ValueState的value,即上一次reduce操作的结果值,然后第二个参数就当前element的value;而在执行完userFunction的reduce操作之后,会将该结果update到ValueState

小结

  • KeyedStream的reduce方法,里头调用了transform方法,而构造的OneInputStreamOperator为StreamGroupedReduce;reduce方法接收的是ReduceFunction,它定义了reduce方法,用来将两个同类型的值操作为一个同类型的值
  • Task的run方法会调用invokable.invoke(),这里的invokable为OneInputStreamTask,而OneInputStreamTask继承了StreamTask,这里实际调用的invoke()方法是StreamTask里头的;StreamTask的invoke方法会调用run方法,该方法为抽象方法,由子类实现,这里就是OneInputStreamTask的run方法;OneInputStreamTask的run方法,会不断循环调用inputProcessor.processInput(),inputProcessor这里为StreamInputProcessor;StreamInputProcessor的processInput方法,会在while true循环里头不断处理nextRecord,这里根据StreamElement的不同类型做不同处理,如果是普通的数据,则调用streamOperator.processElement进行处理,这里的streamOperator为StreamGroupedReduce
  • StreamGroupedReduce的processElement方法会调用userFunction的reduce操作,第一个参数就是ValueState的value,即上一次reduce操作的结果值,然后第二个参数就当前element的value;而在执行完userFunction的reduce操作之后,会将该结果update到ValueState

doc

  • datastream-transformations

你可能感兴趣的:(聊聊flink KeyedStream的reduce操作)