flink系列--connector源码分析

一、概念

connector简单来说就是flink系统对接的外界数据源

二、支持的数据源类型

flink系列--connector源码分析_第1张图片

该截图来自于官网,可以看到官网并没有支持很多常见的数据源,而且支持的数据源不一定source和sink都支持,那这是为什么呢?

  1. 如果你了解flink架构中taskmanager是如何进行数据交换的,那么这个问题也就不难理解,一般一个taskmanager会有两个网关(输入和输出),每一个taskmanager输入网关和上游的taskmanager的输出网关是通过netty进行交互的,有了一个buffer数据上游就可以下发(当然下发策略好多种,比如赞批下发)
  2. source节点接收数据的思想和其他operator节点交换数据是类似的,是一个append模式,简单来说就是数据是一条一条追加的,不是向批处理模式那样一次性处理完交给下游。
  3. 最后想想kafka消息系统的模式,明显一个队列就是append模式啊,从队尾append的数据,从队头消费数据,所以对于kafka来说作为source和sink是非常合理的source节点可以一条一条的消费数据,处理完了可以一条一条的append数据。
  4. 再来看看截图有哪些source(kafka、RabbitMQ、NIFI、Twitter streaming api、pubsub、activeMq、Netty),首先看到kafka、RabbitMq、activeMq三个消息队列就知道,队列flink是几乎可以支持的,所以其他的消息队列系统他肯定也是支持的,只是没有做出来。其次看到了netty,netty是一个网络通信的框架,目的是为了高性能传输数据,并不是一个数据源,应该是作为某个数据源与flink通信媒介,这里暂时不讨论这个(因为我还没分析netty source)。最后看看其他的,都是应用于流数据的自然能和流处理的flink集成
  5. 再来看看截图有哪些sink(kafka、cassandra、kinesis、es、hdfs、rabbitmq、nifi、pubsub、jdbc、activemq、flume、akka、redis、netty),看到了吗,sink支持的数据源似乎很广,为啥?因为flink是流处理,处理的数据自然是动态append进来的,可是我处理完了,我可以静态存入一个数据源,没必要是一个一个的追加到某个数据源。

三、分析

知道了上述基本知识之后,咱们就来分析这些source和sink(当然不可能全部分析,这样要累死,我就分析一个kafka source和es sink,只要懂了一个source或一个sink,其他的也可以快速弄懂原理),好了,废话不多说,上干货。

先来看看source的结构(很重要)

source节点承担着接入数据源的重责,所以他需要的功能自然就越多,一般而言,source节点需要继承实现两个接口,一个是RichFunction,另一个是SourceFunction,然后flink会有一个AbstractRichFunction实现了RichFunction,所以想要实现一个source节点只需要继承AbstractRichFunction就好了,然后实现一下SourceFunction。对接kafka,flink创建了FlinkKafkaConsumerBase这个类,这个类也即是咱们的source节点

public abstract class FlinkKafkaConsumerBase extends RichParallelSourceFunction implements
		CheckpointListener,
		ResultTypeQueryable,
		CheckpointedFunction{}


public abstract class RichParallelSourceFunction extends AbstractRichFunction
		implements ParallelSourceFunction {

	private static final long serialVersionUID = 1L;
}

public interface ParallelSourceFunction extends SourceFunction {
}

从代码中可以看到,继承了RichParallelSourceFunction,RichParallelSourceFunction继承了AbstractRichFunction,然后还实现了ParallelSourceFunction,可以看到ParallelSourceFunction是继承的SourceFunction,这么一来,那FlinkKafkaConsumerBase符合刚才我的说法,确实是继承AbstractRichFunction,和实现了SourceFunction。那其实AbstractRichFunction最重要的是获取运行时上下文,方便source节点获取运行时各种信息。SourceFunction则是有一个很重要的接口run,所以大胆猜测一下,这个run就是用来获取数据源的数据的。AbstractRichFunction还有一个很重要的特征是有个open接口,这个接口就是在task启动之前需要执行的。总结就是:AbstractRichFunction用来获取运行时上下文信息,以及开放一个生命后期方法open,用来启动每个task,SourceFunction则是用来获取数据的逻辑。那FlinkKafkaConsumerBase还实现了CheckpointListener、CheckpointedFunction以及ResultTypeQueryable,很明显从字面上来看,前两个接口都是和检查点相关的,咱们暂时不分析,ResultTypeQueryable也不重要。好,那咱们就开始分析kafka source了。

先来看看这个demo

public class Kafka011Example {

	public static void main(String[] args) throws Exception {
		// parse input arguments
		final ParameterTool parameterTool = ParameterTool.fromArgs(args);
		StreamExecutionEnvironment env = KafkaExampleUtil.prepareExecutionEnv(parameterTool);

		DataStream input = env
				.addSource(
					new FlinkKafkaConsumer011<>(
						parameterTool.getRequired("input-topic"),
						new KafkaEventSchema(),
						parameterTool.getProperties())
					.assignTimestampsAndWatermarks(new CustomWatermarkExtractor()))
				.keyBy("word")
				.map(new RollingAdditionMapper());

		input.addSink(
				new FlinkKafkaProducer011<>(
						parameterTool.getRequired("output-topic"),
						new KafkaEventSchema(),
						parameterTool.getProperties()));

		env.execute("Kafka 0.11 Example");
	}

}

我们可以看到,FlinkKafkaConsumer011被env对象add了,其实这里的逻辑我大概说一下,就不细讲了(env把FlinkKafkaConsumer011 add之后,用operator包装起来,组成transformation,最后形成jobgraph的节点),FlinkKafkaConsumer011本质继承的FlinkKafkaConsumerBase。那他是怎么工作起来了呢,首先节点被打包成jobgraph送到集群中再次改造成executionGraph(多了并行度),然后这时候的jobmanager就把source节点当成一个subtask发出去(假定并行度为1),把task发到taskmanager之后,在taskmanager中就形成了一个task类

public class Task implements Runnable, TaskSlotPayload, TaskActions, PartitionProducerStateProvider, CheckpointListener, BackPressureSampleableTask {

	/** The class logger. */
	private static final Logger LOG = LoggerFactory.getLogger(Task.class);

	/** The thread group that contains all task threads. */
	private static final ThreadGroup TASK_THREADS_GROUP = new ThreadGroup("Flink Task Threads");

	/** For atomic state updates. */
	private static final AtomicReferenceFieldUpdater STATE_UPDATER =
			AtomicReferenceFieldUpdater.newUpdater(Task.class, ExecutionState.class, "executionState");

	// ------------------------------------------------------------------------
	//  Constant fields that are part of the initial Task construction
	// ------------------------------------------------------------------------

	/** The job that the task belongs to. */
	private final JobID jobId;

	/** The vertex in the JobGraph whose code the task executes. */
	private final JobVertexID vertexId;

	/** The execution attempt of the parallel subtask. */
	private final ExecutionAttemptID executionId;

	/** ID which identifies the slot in which the task is supposed to run. */
	private final AllocationID allocationId;

	/** TaskInfo object for this task. */
	private final TaskInfo taskInfo;

	/** The name of the task, including subtask indexes. */
	private final String taskNameWithSubtask;

	/** The job-wide configuration object. */
	private final Configuration jobConfiguration;

	/** The task-specific configuration. */
	private final Configuration taskConfiguration;

	/** The jar files used by this task. */
	private final Collection requiredJarFiles;

	/** The classpaths used by this task. */
	private final Collection requiredClasspaths;

	/** The name of the class that holds the invokable code. */
	private final String nameOfInvokableClass;

	/** Access to task manager configuration and host names. */
	private final TaskManagerRuntimeInfo taskManagerConfig;

	/** The memory manager to be used by this task. */
	private final MemoryManager memoryManager;

	/** The I/O manager to be used by this task. */
	private final IOManager ioManager;

	/** The BroadcastVariableManager to be used by this task. */
	private final BroadcastVariableManager broadcastVariableManager;

	private final TaskEventDispatcher taskEventDispatcher;

	/** The manager for state of operators running in this task/slot. */
	private final TaskStateManager taskStateManager;

	/** Serialized version of the job specific execution configuration (see {@link ExecutionConfig}). */
	private final SerializedValue serializedExecutionConfig;

	private final ResultPartitionWriter[] consumableNotifyingPartitionWriters;

	private final InputGate[] inputGates;

	/** Connection to the task manager. */
	private final TaskManagerActions taskManagerActions;

	/** Input split provider for the task. */
	private final InputSplitProvider inputSplitProvider;

	/** Checkpoint notifier used to communicate with the CheckpointCoordinator. */
	private final CheckpointResponder checkpointResponder;

	/** GlobalAggregateManager used to update aggregates on the JobMaster. */
	private final GlobalAggregateManager aggregateManager;

	/** The BLOB cache, from which the task can request BLOB files. */
	private final BlobCacheService blobService;

	/** The library cache, from which the task can request its class loader. */
	private final LibraryCacheManager libraryCache;

	/** The cache for user-defined files that the invokable requires. */
	private final FileCache fileCache;

	/** The service for kvState registration of this task. */
	private final KvStateService kvStateService;

	/** The registry of this task which enables live reporting of accumulators. */
	private final AccumulatorRegistry accumulatorRegistry;

	/** The thread that executes the task. */
	private final Thread executingThread;

	/** Parent group for all metrics of this task. */
	private final TaskMetricGroup metrics;

	/** Partition producer state checker to request partition states from. */
	private final PartitionProducerStateChecker partitionProducerStateChecker;

	/** Executor to run future callbacks. */
	private final Executor executor;

	/** Future that is completed once {@link #run()} exits. */
	private final CompletableFuture terminationFuture = new CompletableFuture<>();

	// ------------------------------------------------------------------------
	//  Fields that control the task execution. All these fields are volatile
	//  (which means that they introduce memory barriers), to establish
	//  proper happens-before semantics on parallel modification
	// ------------------------------------------------------------------------

	/** atomic flag that makes sure the invokable is canceled exactly once upon error. */
	private final AtomicBoolean invokableHasBeenCanceled;

	/** The invokable of this task, if initialized. All accesses must copy the reference and
	 * check for null, as this field is cleared as part of the disposal logic. */
	@Nullable
	private volatile AbstractInvokable invokable;

	/** The current execution state of the task. */
	private volatile ExecutionState executionState = ExecutionState.CREATED;

	/** The observed exception, in case the task execution failed. */
	private volatile Throwable failureCause;

	/** Initialized from the Flink configuration. May also be set at the ExecutionConfig */
	private long taskCancellationInterval;

	/** Initialized from the Flink configuration. May also be set at the ExecutionConfig */
	private long taskCancellationTimeout;

	/** This class loader should be set as the context class loader for threads that may dynamically load user code. */
	private ClassLoader userCodeClassLoader;

这个类很庞大,就代表一个独立运行的任务,一个关键属性就是invokable,这个是真正干实事的类,他组装了operator,而operator包含咱们的FlinkKafkaConsumerBase,所以咱们关键是要看invokable是如何调用咱们的FlinkKafkaConsumerBase。

从task run开始:

public void run() {
		try {
			doRun();
		} finally {
			terminationFuture.complete(executionState);
		}
	}

private void doRun() {
            ......
			invokable = loadAndInstantiateInvokable(userCodeClassLoader, nameOfInvokableClass, env);
            ......
			invokable.invoke();
            ......
	}

其实dorun就干了一件事,就是执行invokable.invoke();,在invoke里面有咱们的答案

public final void invoke() throws Exception {
		try {
			beforeInvoke();
            ......
			runMailboxLoop();
            ......
			afterInvoke();
		}
		finally {
			cleanUpInvoke();
		}
	}

runMailboxLoop就是准备启动任务,beforeInvoke就是在启动前要做一些初始化,afterinvoke即是在执行任务之后进行的逻辑,很明显咱们FlinkKafkaConsumerBase需要在任务启动之前就要初始化各种东西,刚才的demo里只是创建了FlinkKafkaConsumerBase,并没有初始化。好,现在我们来看看beforeInvoke。

private void beforeInvoke() throws Exception {
		......
		operatorChain = new OperatorChain<>(this, recordWriter);
		headOperator = operatorChain.getHeadOperator();

		// task specific initialization
		//所以具体的任务有具体的初始化
		init();
		......
		actionExecutor.runThrowing(() -> {
			
			initializeStateAndOpen();
		});
	}

private void initializeStateAndOpen() throws Exception {

		StreamOperator[] allOperators = operatorChain.getAllOperators();

		for (StreamOperator operator : allOperators) {
			if (null != operator) {
				//初始化operatorStateBackend、keyedStateBackend
				operator.initializeState();
				//
				operator.open();
			}
		}
	}

public void open() throws Exception {
		super.open();
		FunctionUtils.openFunction(userFunction, new Configuration());
	}

public static void openFunction(Function function, Configuration parameters) throws Exception{
		if (function instanceof RichFunction) {
			RichFunction richFunction = (RichFunction) function;
			richFunction.open(parameters);
		}
	}

可以看到这个方法只有一个init还有initializeStateAndOpen,这里的init是用来初始化task类的一些东西的,我们不管,我们需要的是FlinkKafkaConsumerBase的初始化,其实答案就在initializeStateAndOpen中,这个函数就是用来初始化source节点的状态和调用咱们刚才说的生命周期方法open(operator.open会调用function的open的),从openFunction中我们终于看到了RichFuntion的open方法。来看看

public void open(Configuration configuration) throws Exception {
		......
		this.partitionDiscoverer = createPartitionDiscoverer(
				topicsDescriptor,
				getRuntimeContext().getIndexOfThisSubtask(),
				getRuntimeContext().getNumberOfParallelSubtasks());
		this.partitionDiscoverer.open();
        ......
		if (restoredState != null) {
			for (KafkaTopicPartition partition : allPartitions) {
				if (!restoredState.containsKey(partition)) {
					restoredState.put(partition, KafkaTopicPartitionStateSentinel.EARLIEST_OFFSET);
				}
			}
        ......
	}

在open这里,主要是先创建一个kafkapartiiton发现器,获取到kafka分区了,就放进restoredState,做checkpoint。下一步,很明显就是执行run方法进行获取数据了。那么是怎么执行到run方法的呢。这其实很简单,刚才我们说到beforeInvoke事在执行任务时做一些初始化的任务,然后就是执行runMailboxLoop(这个方法最终调用processInput方法,也即是真的处理数据的方法)

protected void processInput(MailboxDefaultAction.Controller controller) throws Exception {
        ......
		sourceThread.setTaskDescription(getName());
		sourceThread.start();
		sourceThread.getCompletionFuture().whenComplete((Void ignore, Throwable sourceThreadThrowable) -> {
			if (isCanceled() && ExceptionUtils.findThrowable(sourceThreadThrowable, InterruptedException.class).isPresent()) {
				mailboxProcessor.reportThrowable(new CancelTaskException(sourceThreadThrowable));
			} else if (!isFinished && sourceThreadThrowable != null) {
				mailboxProcessor.reportThrowable(sourceThreadThrowable);
			} else {
				mailboxProcessor.allActionsCompleted();
			}
		});
	}

可以看到processInput就是启动了一个线程,我们再来看看这个线程干了啥

public void run() {
			try {
				headOperator.run(getCheckpointLock(), getStreamStatusMaintainer(), operatorChain);
				completionFuture.complete(null);
			} catch (Throwable t) {
				// Note, t can be also an InterruptedException
				completionFuture.completeExceptionally(t);
			}
		}

看到了吗,就是执行operator的run方法呀,不出意外那operator的run方法肯定执行了FlinkKafkaConsumerBase的run方法

public void run(final Object lockingObject,
			final StreamStatusMaintainer streamStatusMaintainer,
			final Output> collector,
			final OperatorChain operatorChain) throws Exception {

		......
		try {
			userFunction.run(ctx);
        ......
		} finally {
			if (latencyEmitter != null) {
				latencyEmitter.close();
			}
		}
	}

这里的userFunction就是咱们的FlinkKafkaConsumerBase呀,看看他的实现

public void run(SourceContext sourceContext) throws Exception {
		......
		this.kafkaFetcher = createFetcher(
				sourceContext,
				subscribedPartitionsToStartOffsets,
				periodicWatermarkAssigner,
				punctuatedWatermarkAssigner,
				(StreamingRuntimeContext) getRuntimeContext(),
				offsetCommitMode,
				getRuntimeContext().getMetricGroup().addGroup(KAFKA_CONSUMER_METRICS_GROUP),
				useMetrics);
        ......
		if (discoveryIntervalMillis == PARTITION_DISCOVERY_DISABLED) {
			kafkaFetcher.runFetchLoop();
		} else {
			runWithPartitionDiscovery();
		}
	}

可以看到run方法中先是创建一个Fetcher,然后就开始执行他的runFetchLoop方法,从名字上来看,明显这个类是用来获取数据的呀。来看看的具体实现

public void runFetchLoop() throws Exception {
		try {
			final Handover handover = this.handover;
			consumerThread.start();

			while (running) {
				final ConsumerRecords records = handover.pollNext();

				
				for (KafkaTopicPartitionState partition : subscribedPartitionStates()) {

					List> partitionRecords =
						records.records(partition.getKafkaPartitionHandle());

					for (ConsumerRecord record : partitionRecords) {
						final T value = deserializer.deserialize(record);

						if (deserializer.isEndOfStream(value)) {
							// end of stream signaled
							running = false;
							break;
						}
						emitRecord(value, partition, record.offset(), record);
					}
				}
			}
		}
		finally {
			consumerThread.shutdown();
		}
		try {
			consumerThread.join();
		}
		catch (InterruptedException e) {
			Thread.currentThread().interrupt();
		}
	}

这里逻辑非常重要,主要是两个类在工作ConsumerThread和Handover,看看这个逻辑,首先时创建handover(可以把他当成一个阻塞队列),然后启动consumerThread线程,这个线程通过KafkaConsumer来获取Kafka服务器的内容,然后把获取的数据放到handover,然后就开始while(true)阻塞获取handover的数据,获取到了就emitRecord,这个emitRecord逻辑并不是直接送到下游呀,之前咱们说过,taskmanager的task之间是通过netty交互传输数据的,所以emitRecord是把数据送到输出的网关,由输出网关和下游网关打交道。现在最关键先来看,consumerThread是如何获取到数据的

public void run() {
		......
		final Handover handover = this.handover;
        ......
		try {
			this.consumer = getConsumer(kafkaProperties);
		}
		catch (Throwable t) {
			handover.reportError(t);
			return;
		}
		try {
			......
			ConsumerRecords records = null;
			List> newPartitions;

			// main fetch loop
			while (running) {
                ......
				if (records == null) {
					try {
						records = consumer.poll(pollTimeout);
					}
					catch (WakeupException we) {
						continue;
					}
				}
				try {
					handover.produce(records);
					records = null;
				}
				catch (Handover.WakeupException e) {
					// fall through the loop
				}
			}
			// end main fetch loop
		}
		......
	}

其实可以看到,这里逻辑非常清晰,这里用consumer直接拉取数据,然后放到handover,而且这里consumer拉取数据不是一个数据,而是一批。至于这个consumer是啥,之前说过,他就是KafkaConsumer,是Kafka的client用来访问Kafka server的。那他是怎么来的呢,其实就是由咱们主函数提供的Properties创建出来的。那这样一说来,open的执行似乎并没有起到很关键的作用,其实刚才说到open有一个partitionDiscover会请求到kafka的所有kafkapartition,然后放到restoreState中,所以我们并没有用到这个restoreState,其实这个我们确实在获取数据的时候没有用到这个,但是他在checkpoint中用到了呀,你可以看看这个属性的注释就会知道,这个变量是先initialState中先初始化,然后在open中记录kafka的分区,offset等记录。所以这么看来open和run是没关系的。

小结一下kafkasource执行流程:

  1. FlinkKafkaConsumerBase在用户自己编写的主程序创建之后,就交给env包装成operator,最后打包成jobgraph由client送到jobmanager手中
  2. jobmanager经过并行度、chain等优化,把jobgraph编程executiongraph
  3. executiongraph把子任务下发到taskmanager,子任务抽象为一个Task类
  4. task类会调用真正的执行类AbstractInvokable的invoke方法
  5. invoke方法分为三步:beforeInvoke、runLoop、afterInvoke
  6. beforeInvoke会调用FlinkKafkaConsumerBase的open方法,保存checkpoint需要的各种信息(比如需要保存kafka的分区、消费到那个offset的了)
  7. runLoop会启动一个线程sourceThread去真正处理source节点该干的事
  8. sourceThread会去调用operator的run方法,而operator就会去调用FlinkKafkaConsumerBase的run方法
  9. FlinkKafkaConsumerBase的run方法会去创建一个fetcher真正的来获取数据
  10. fetcher有两个重要的类来完成获取数据(一个是用来拉数据的consumerThread,一个是用来放数据的Handover)
  11. consumerThread直接利用kafkaConsumer(kafka client的类)获取服务器的数据
  12. 获取到之后就放进handover,然后fetcher就循环阻塞从handover拿数据,拿到之后先反序列化一波
  13. 最后就是把数据emitRecord
  14. emitRecord不是直接发送到下游,而是直接交给他的输出网关。

接下来是ES的sink分析,其实看完了上面的分析之后,分析ES就很简单

首先咱们还是来看看一个sink有啥特点,其实和source是差不多的,也是继承AbstractRichFunction,然后有点不同的是实现的是SinkFunction,这个接口有个重要的方法就是invoke,和SourceFunction是异曲同工的。好,我们首先要知道es的sink是啥

public abstract class ElasticsearchSinkBase extends RichSinkFunction implements CheckpointedFunction

可以看到类的写法和kafka的source差不多,也会实现CheckpointedFunction。先看看他的demo

public class Elasticsearch7SinkExample {

	public static void main(String[] args) throws Exception {

		final ParameterTool parameterTool = ParameterTool.fromArgs(args);

		if (parameterTool.getNumberOfParameters() < 2) {
			System.out.println("Missing parameters!\n" +
				"Usage: --numRecords  --index ");
			return;
		}

		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
				env.enableCheckpointing(5000);

		DataStream> source = env.generateSequence(0, parameterTool.getInt("numRecords") - 1)
			.flatMap(new FlatMapFunction>() {
				@Override
				public void flatMap(Long value, Collector> out) {
					final String key = String.valueOf(value);
					final String message = "message #" + value;
					out.collect(Tuple2.of(key, message + "update #1"));
					out.collect(Tuple2.of(key, message + "update #2"));
				}
			});

		List httpHosts = new ArrayList<>();
		httpHosts.add(new HttpHost("127.0.0.1", 9200, "http"));

		ElasticsearchSink.Builder> esSinkBuilder = new ElasticsearchSink.Builder<>(
			httpHosts,
			(Tuple2 element, RuntimeContext ctx, RequestIndexer indexer) -> {
				indexer.add(createIndexRequest(element.f1, parameterTool));
				indexer.add(createUpdateRequest(element, parameterTool));
			});

		esSinkBuilder.setFailureHandler(
			new CustomFailureHandler(parameterTool.getRequired("index")));

		// this instructs the sink to emit after every element, otherwise they would be buffered
		esSinkBuilder.setBulkFlushMaxActions(1);

		source.addSink(esSinkBuilder.build());

		env.execute("Elasticsearch 7.x end to end sink test example");
	}

	private static class CustomFailureHandler implements ActionRequestFailureHandler {

		private static final long serialVersionUID = 942269087742453482L;

		private final String index;

		CustomFailureHandler(String index) {
			this.index = index;
		}

		@Override
		public void onFailure(ActionRequest action, Throwable failure, int restStatusCode, RequestIndexer indexer) throws Throwable {
			if (action instanceof IndexRequest) {
				Map json = new HashMap<>();
				json.put("data", ((IndexRequest) action).source());

				indexer.add(
					Requests.indexRequest()
						.index(index)
						.id(((IndexRequest) action).id())
						.source(json));
			} else {
				throw new IllegalStateException("unexpected");
			}
		}
	}

	private static IndexRequest createIndexRequest(String element, ParameterTool parameterTool) {
		Map json = new HashMap<>();
		json.put("data", element);

		String index;
		String type;

		if (element.startsWith("message #15")) {
			index = ":intentional invalid index:";
			type = ":intentional invalid type:";
		} else {
			index = parameterTool.getRequired("index");
		}

		return Requests.indexRequest()
			.index(index)
			.id(element)
			.source(json);
	}

	private static UpdateRequest createUpdateRequest(Tuple2 element, ParameterTool parameterTool) {
		Map json = new HashMap<>();
		json.put("data", element.f1);

		return new UpdateRequest(
				parameterTool.getRequired("index"),
				parameterTool.getRequired("type"),
				element.f0)
			.doc(json)
			.upsert(json);
	}
}

可以看到这里并不是直接把ESsink创建出来扔进env的,而是先创建一个builder,然后在build出来再扔进去。为啥要这样呀?那这就是经典的builder设计模式了,你会发现,创建builder的时候传了一个httphost和一个esFunction,这个esfunction就是咱们把获取到的数据组装成es需要的格式,然后扔给es client,然后builder还会set一下其他属性。这些属性到时候都是要传给essink的,所以用builder模式可以达到灵活插拔,想给essink注入啥属性就注入啥属性。说完这个builder之后其实最重要的还是要关注invoke和open是如何调用的以及何时调用的。

依葫芦画瓢嘛,sink自然也是一个task最终还是会调用beforeInvoke,那beforeInvoke还不是调用的open方法嘛,所以我们很轻松的知道了open在何时调用的,那咱们来看看他的具体实现:

public void open(Configuration parameters) throws Exception {
		client = callBridge.createClient(userConfig);
		bulkProcessor = buildBulkProcessor(new BulkProcessorListener());
		requestIndexer = callBridge.createBulkProcessorIndexer(bulkProcessor, flushOnCheckpoint, numPendingRequests);
		failureRequestIndexer = new BufferingNoOpRequestIndexer();
	}

原来open方法就这么简单就创建client、bulkprocessor、requestIndexer,看看这几行代码就知道这三个都是由callBridge创建出来的,这个callbridge就是咱们在用户主程序中传入的参数创建的呀。首先说明这个client就是咱们的es client,类似于kafkaconsumer,是和es server交互的client,其次这个bulkprocessor是为了批处理准备的,flink是流处理为什么要批写入呢,原因时,你老是处理完一条数据就push一条数据到服务器上,这样大大滴浪费了时间呀,就像jdbc一条一条的写入mysql一样性能很低,所以为了提高性能就使用了bulkprocessor。最后这个requestIndexer就是用来接受用户的一条一条的请求的,在我们的用户主程序可以看到,用户最终是要执行indexer.add(request)方法的,这个indexer便是requestIndexer了。整体的结构是requestIndexer包含bulkProcessor,bulkProcessor包含client。先来看看RequestIndexer的代码:

public void add(IndexRequest... indexRequests) {
		for (IndexRequest indexRequest : indexRequests) {
			if (flushOnCheckpoint) {
				numPendingRequestsRef.getAndIncrement();
			}
			this.bulkProcessor.add(indexRequest);
		}
	}

可以看到RequestIndexer只是一个媒介呀,当执行add的时候,其实是把request add 到bulkProcessor,所以我们来看看bulkProcessor

private void internalAdd(DocWriteRequest request) {
        Tuple bulkRequestToExecute = null;
        this.lock.lock();

        try {
            this.ensureOpen();
            this.bulkRequest.add(request);
            bulkRequestToExecute = this.newBulkRequestIfNeeded();
        } finally {
            this.lock.unlock();
        }

        if (bulkRequestToExecute != null) {
            this.execute((BulkRequest)bulkRequestToExecute.v1(), (Long)bulkRequestToExecute.v2());
        }

    }

其实bulkProcessor的add方法最终是往bulkRequest add request,其实很好理解,咱们每一条请求都是一条数据,这些数据最终汇集到bulkRequest(顾名思义就是块请求,也就是批请求),而且add的时候还上锁了。那这个bulkRequest是怎么push到es的server的呢?要解决这个问题那就是要看BulkProcessor这个类了,先来看看他的构造函数

BulkProcessor(BiConsumer> consumer, BackoffPolicy backoffPolicy, BulkProcessor.Listener listener, int concurrentRequests, int bulkActions, ByteSizeValue bulkSize, @Nullable TimeValue flushInterval, Scheduler scheduler, Runnable onClose, Supplier bulkRequestSupplier) {
        this.bulkActions = bulkActions;
        this.bulkSize = bulkSize.getBytes();
        this.bulkRequest = (BulkRequest)bulkRequestSupplier.get();
        this.bulkRequestSupplier = bulkRequestSupplier;
        this.bulkRequestHandler = new BulkRequestHandler(consumer, backoffPolicy, listener, scheduler, concurrentRequests);
        this.cancellableFlushTask = this.startFlushTask(flushInterval, scheduler);
        this.onClose = onClose;
    }

有没有看到关键的东西?他在构造函数就执行了startFlushTask,看这个名字就知道就是刷新任务呗,那还能刷新啥任务,不就是把数据push到es上嘛,但是我怎么确定我有数据呢?这其实就是设置了一个定时器(flushInterval)比如10秒,10秒内都可以把request add到bulkRequest中,一旦时间一到,就会执行client.bulkAsync方法把bulkrequest中的数据一次性全部push到es server。startFlushTask最终会调用如下代码:

private void execute() {
        BulkRequest bulkRequest = this.bulkRequest;
        long executionId = this.executionIdGen.incrementAndGet();
        this.bulkRequest = (BulkRequest)this.bulkRequestSupplier.get();
        this.execute(bulkRequest, executionId);
    }

先把bulkRequest拿出来,然后立马new一个新的bulkRequest,最后把这个bulkRequest发出去,不知道大家理不理解这几句代码。其实就是先把有一定数据的bulkrequest用临时变量存起来,然后就new一个新的用于下次的接受用户的request,接着把刚才用临时变量保存的数据push到es server,真整过程就这样以定时器的方式运行着,即是没有数据也是这样运行着。所以回顾一下open干了啥?其实open就是初始化了要连接es server的所有组件,就等着invoke方法产生数据放到bulkRequest中,一旦到了定时器时间,就往es server push。

接下来就来看看invoke是如何调用的,很明显他也是由processInput调用的,来看看sink的processInput

protected void processInput(MailboxDefaultAction.Controller controller) throws Exception {
		InputStatus status = inputProcessor.processInput();
		if (status == InputStatus.MORE_AVAILABLE && recordWriter.isAvailable()) {
			return;
		}
		if (status == InputStatus.END_OF_INPUT) {
			controller.allActionsCompleted();
			return;
		}
		CompletableFuture jointFuture = getInputOutputJointFuture(status);
		MailboxDefaultAction.Suspension suspendedDefaultAction = controller.suspendDefaultAction();
		jointFuture.thenRun(suspendedDefaultAction::resume);
	}

其实调用链一直调用下去,会调用operator的processElement方法,最终调用esSink的invoke方法

public void processElement(StreamRecord element) throws Exception {
		sinkContext.element = element;
		userFunction.invoke(element.getValue(), sinkContext);
	}

其实invoke方法很简单,就是调用用户主程序传给builder的函数ElasticsearchSinkFunction,调用这个函数不就是给RequestIndexer发request,最终还是发到bulkRequest中,然后随着定时器触发就push到es server中。

小结一下esSink的执行过程:

  1. 用户主程序创建esSinkbuilder,重要的是创一个ElasticsearchSinkFunction进去,但是时候给esSink的invoke调用,
  2. 然后设置属性之后就build,扔给env
  3. 还是组装成jobgraph给jobmanager,最后形成executiongraph
  4. excutiongraph把sink任务交给taskmanager抽象为task,最后执行StreamTask的invoke方法
  5. 和kafka source一样,先执行beforeInvoke,调用open方法,初始化所有与es server通信的组件,client(发数据的核心类)、bulkprocessor(把数据攒成一批再调用client发出去)、requestIndexer(用户发出请求到bulkProcessor的bulkRequest的媒介),这时与es的通信已经开始了
  6. 调用runloop,执行processInput,最终调用operator的processElement,processElement调用esSink的invoke方法
  7. esSink的invoke调用用户主程序传进来的函数,在这个函数里把数据组装成请求,通过requestIndexer add到bulkrequest
  8. bulkrequest被定时发送到es server中。

 

总结:其实source和sink的核心都是两个方法source是open和run,sink是open和invoke,两者的open用途明显不一样,kafka的open方法是获取卡夫卡的信息做checkpoint和获取数据没有关系,而sink的open就是初始化与数据源服务器交互的各种组件,就等数据通过invoke方法送到组件中,组件自己通过定时发送给服务器。

你可能感兴趣的:(flink系列,flink)