flink深入研究(11) DataStream类中flatMap函数调用流程

flatMap算子和map有些相似,输入都是数据流中的每个元素,与之不同的是,flatMap的输出可以是零个、一个或多个元素,它的输出元素放在了Collector类对象中,接下来我们进入到代码中来进行分析,我们使用示例如下:

// 计算数据
DataStream windowCount = text.flatMap(new FlatMapFunction() {
			public void flatMap(String value, Collector out) throws Exception {
				String[] splits = value.split(" ");
				for (String word : splits) {
					out.collect(new WordWithCount(word, 1L));
				}
			}
		})

flatMap函数代码如下:

/**
	 * Applies a FlatMap transformation on a {@link DataStream}. The
	 * transformation calls a {@link FlatMapFunction} for each element of the
	 * DataStream. Each FlatMapFunction call can return any number of elements
	 * including none. The user can also extend {@link RichFlatMapFunction} to
	 * gain access to other features provided by the
	 * {@link org.apache.flink.api.common.functions.RichFunction} interface.
	 *
	 * @param flatMapper
	 *            The FlatMapFunction that is called for each element of the
	 *            DataStream
	 *
	 * @param 
	 *            output type
	 * @return The transformed {@link DataStream}.
	 */
	public  SingleOutputStreamOperator flatMap(FlatMapFunction flatMapper) {
        //获取flatMapper的输出类型
		TypeInformation outType = TypeExtractor.getFlatMapReturnTypes(clean(flatMapper),
				getType(), Utils.getCallLocationName(), true);
        
		return transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));

	}

clean函数我们之前讲过,主要是用来去掉FlatMapFunction类对象的外部类(变量是$this#)的部分,确保能够进行序列化操作。

getType函数是获取类对象的输出类型,我们进入到这个函数中,代码如下:

/**
	 * Gets the type of the stream.
	 *
	 * @return The type of the datastream.
	 */
	public TypeInformation getType() {
		return transformation.getOutputType();
	}

这个函数调用Transformation类对象的getOutputType函数来获取它的输出类型,这个Transformation类属于source operator,也就是我们前面讲的DataStreamSource类对象text。

接下来我们看看Utils.getCallLocationName()函数,这个函数通过获取该线程中的栈帧信息。

我们继续分析函数getFlatMapReturnTypes,代码如下:

@PublicEvolving
	public static  TypeInformation getFlatMapReturnTypes(FlatMapFunction flatMapInterface, TypeInformation inType,
			String functionName, boolean allowMissing)
	{
		return getUnaryOperatorReturnType(
			(Function) flatMapInterface,
			FlatMapFunction.class,
			0,
			1,
			new int[]{1, 0},
			inType,
			functionName,
			allowMissing);
	}

getUnaryOperatorReturnType函数,这个函数用来返回flatMapInterface中的返回数据类型,代码如下:

/**
	 * Returns the unary operator's return type.
	 *
	 * 

This method can extract a type in 4 different ways: * *

1. By using the generics of the base class like MyFunction. * This is what outputTypeArgumentIndex (in this example "4") is good for. * *

2. By using input type inference SubMyFunction. * This is what inputTypeArgumentIndex (in this example "0") and inType is good for. * *

3. By using the static method that a compiler generates for Java lambdas. * This is what lambdaOutputTypeArgumentIndices is good for. Given that MyFunction has * the following single abstract method: * *

	 * 
	 * void apply(IN value, Collector value)
	 * 
	 * 
* *

Lambda type indices allow the extraction of a type from lambdas. To extract the * output type OUT from the function one should pass {@code new int[] {1, 0}}. * "1" for selecting the parameter and 0 for the first generic in this type. * Use {@code TypeExtractor.NO_INDEX} for selecting the return type of the lambda for * extraction or if the class cannot be a lambda because it is not a single abstract * method interface. * *

4. By using interfaces such as {@link TypeInfoFactory} or {@link ResultTypeQueryable}. * *

See also comments in the header of this class. * * @param function Function to extract the return type from * @param baseClass Base class of the function * @param inputTypeArgumentIndex Index of input generic type in the base class specification (ignored if inType is null) * @param outputTypeArgumentIndex Index of output generic type in the base class specification * @param lambdaOutputTypeArgumentIndices Table of indices of the type argument specifying the input type. See example. * @param inType Type of the input elements (In case of an iterable, it is the element type) or null * @param functionName Function name * @param allowMissing Can the type information be missing (this generates a MissingTypeInfo for postponing an exception) * @param Input type * @param Output type * @return TypeInformation of the return type of the function */ @SuppressWarnings("unchecked") @PublicEvolving public static TypeInformation getUnaryOperatorReturnType( Function function, Class baseClass, int inputTypeArgumentIndex, int outputTypeArgumentIndex, int[] lambdaOutputTypeArgumentIndices, TypeInformation inType, String functionName, boolean allowMissing) { Preconditions.checkArgument(inType == null || inputTypeArgumentIndex >= 0, "Input type argument index was not provided"); Preconditions.checkArgument(outputTypeArgumentIndex >= 0, "Output type argument index was not provided"); Preconditions.checkArgument( lambdaOutputTypeArgumentIndices != null, "Indices for output type arguments within lambda not provided"); // explicit result type has highest precedence //如果实现了ResultTypeQueryable接口,那么直接通过function中的getProducedType函数返 //回输出类型 if (function instanceof ResultTypeQueryable) { return ((ResultTypeQueryable) function).getProducedType(); } // perform extraction try { final LambdaExecutable exec; try { //判断function是否是lambda表达式 exec = checkAndExtractLambda(function); } catch (TypeExtractionException e) { throw new InvalidTypesException("Internal error occurred.", e); } //如果是lambda实现的function,那么就根据lambda表达式特点来获取相应的返回类型 if (exec != null) { // parameters must be accessed from behind, since JVM can add additional parameters e.g. when using local variables inside lambda function // paramLen is the total number of parameters of the provided lambda, it includes parameters added through closure final int paramLen = exec.getParameterTypes().length; final Method sam = TypeExtractionUtils.getSingleAbstractMethod(baseClass); // number of parameters the SAM of implemented interface has; the parameter indexing applies to this range final int baseParametersLen = sam.getParameterTypes().length; final Type output; if (lambdaOutputTypeArgumentIndices.length > 0) { output = TypeExtractionUtils.extractTypeFromLambda( baseClass, exec, lambdaOutputTypeArgumentIndices, paramLen, baseParametersLen); } else { output = exec.getReturnType(); TypeExtractionUtils.validateLambdaType(baseClass, output); } return new TypeExtractor().privateCreateTypeInfo(output, inType, null); } else {//通过反射来获取输出类型 if (inType != null) { validateInputType(baseClass, function.getClass(), inputTypeArgumentIndex, inType); } return new TypeExtractor().privateCreateTypeInfo(baseClass, function.getClass(), outputTypeArgumentIndex, inType, null); } } catch (InvalidTypesException e) { if (allowMissing) { return (TypeInformation) new MissingTypeInfo(functionName != null ? functionName : function.toString(), e); } else { throw e; } } }

checkAndExtractLambda函数代码如下:

/**
	 * Checks if the given function has been implemented using a Java 8 lambda. If yes, a LambdaExecutable
	 * is returned describing the method/constructor. Otherwise null.
	 *
	 * @throws TypeExtractionException lambda extraction is pretty hacky, it might fail for unknown JVM issues.
	 */
	public static LambdaExecutable checkAndExtractLambda(Function function) throws TypeExtractionException {
		try {
			// get serialized lambda
			SerializedLambda serializedLambda = null;
			for (Class clazz = function.getClass(); clazz != null; clazz = clazz.getSuperclass()) {
				try {
                    //1、函数式接口继承Serializable时,编译器在编译Lambda表达式时,生成了一个writeReplace方法,这个方法会返回SerializedLambda,可以反射调用这个方法;
2、SerializedLambda是对Lambda表达式进行描述的对象,在Lambda表达式可序列化的时候(函数式接口继承Serializable)才能得到;
					Method replaceMethod = clazz.getDeclaredMethod("writeReplace");
					replaceMethod.setAccessible(true);
					Object serialVersion = replaceMethod.invoke(function);

					// check if class is a lambda function
					if (serialVersion != null && serialVersion.getClass() == SerializedLambda.class) {
						serializedLambda = (SerializedLambda) serialVersion;
						break;
					}
				}
				catch (NoSuchMethodException e) {
					// thrown if the method is not there. fall through the loop
				}
			}

			// not a lambda method -> return null
			if (serializedLambda == null) {
				return null;
			}

			// find lambda method
			String className = serializedLambda.getImplClass();
			String methodName = serializedLambda.getImplMethodName();
			String methodSig = serializedLambda.getImplMethodSignature();

			Class implClass = Class.forName(className.replace('/', '.'), true, Thread.currentThread().getContextClassLoader());

			// find constructor
			if (methodName.equals("")) {
				Constructor[] constructors = implClass.getDeclaredConstructors();
				for (Constructor constructor : constructors) {
					if (getConstructorDescriptor(constructor).equals(methodSig)) {
						return new LambdaExecutable(constructor);
					}
				}
			}
			// find method
			else {
				List methods = getAllDeclaredMethods(implClass);
				for (Method method : methods) {
					if (method.getName().equals(methodName) && getMethodDescriptor(method).equals(methodSig)) {
						return new LambdaExecutable(method);
					}
				}
			}
			throw new TypeExtractionException("No lambda method found.");
		}
		catch (Exception e) {
			throw new TypeExtractionException("Could not extract lambda method out of function: " +
				e.getClass().getSimpleName() + " - " + e.getMessage(), e);
		}
	}

关于lambda表达式的详细内容可以参考https://blog.csdn.net/u012503481/article/details/100896507和https://blog.csdn.net/lirx_tech/article/details/51303966

拿到输出类型后,那么运行transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));

首先将flatMapper中的外部类清除掉,然后创建StreamFlatMap类对象,StreamFlatMap是一个operator,继承结构图如下:

flink深入研究(11) DataStream类中flatMap函数调用流程_第1张图片 operator继承结构图

我们进入到函数transform中,代码如下:

/**
	 * Method for passing user defined operators along with the type
	 * information that will transform the DataStream.
	 *
	 * @param operatorName
	 *            name of the operator, for logging purposes
	 * @param outTypeInfo
	 *            the output type of the operator
	 * @param operator
	 *            the object containing the transformation logic
	 * @param 
	 *            type of the return stream
	 * @return the data stream constructed
	 */
	@PublicEvolving
	public  SingleOutputStreamOperator transform(String operatorName, TypeInformation outTypeInfo, OneInputStreamOperator operator) {

		// read the output type of the input Transform to coax out errors about MissingTypeInfo
		transformation.getOutputType();
		OneInputTransformation resultTransform = new OneInputTransformation<>(
				this.transformation,
				operatorName,
				operator,
				outTypeInfo,
				environment.getParallelism());

		@SuppressWarnings({ "unchecked", "rawtypes" })
		SingleOutputStreamOperator returnStream = new SingleOutputStreamOperator(environment, resultTransform);
        //将resultTransform对象添加到环境变量env中的transformations列表中
		getExecutionEnvironment().addOperator(resultTransform);
        
		return returnStream;
	}

总结一下上游stream调用flatmap函数的流程:

1、获取Function的输出类型

2、在transform函数中,创建一个Transform类对象,里面有输入的Transform类对象,有操作operator,有输出类型,名称,并行度,这样上一个Transform类对象和下一个Transform类对象就链接起来类,然后返回一个流对象,最后将这个Transform类对象放入到env中的transform类对象列表中。其实flink的整个流程就是通过Transform类对象组合起来的,Transform类对象中包含了它的上一个Transform类对象,本类对象的操作类,输出类型,并行度,名称等,如此便串联了起来,形成一个流链条。

 

你可能感兴趣的:(Flink)