flatMap
算子和map
有些相似,输入都是数据流中的每个元素,与之不同的是,flatMap
的输出可以是零个、一个或多个元素,它的输出元素放在了Collector类对象中,接下来我们进入到代码中来进行分析,我们使用示例如下:
// 计算数据
DataStream windowCount = text.flatMap(new FlatMapFunction() {
public void flatMap(String value, Collector out) throws Exception {
String[] splits = value.split(" ");
for (String word : splits) {
out.collect(new WordWithCount(word, 1L));
}
}
})
flatMap函数代码如下:
/**
* Applies a FlatMap transformation on a {@link DataStream}. The
* transformation calls a {@link FlatMapFunction} for each element of the
* DataStream. Each FlatMapFunction call can return any number of elements
* including none. The user can also extend {@link RichFlatMapFunction} to
* gain access to other features provided by the
* {@link org.apache.flink.api.common.functions.RichFunction} interface.
*
* @param flatMapper
* The FlatMapFunction that is called for each element of the
* DataStream
*
* @param
* output type
* @return The transformed {@link DataStream}.
*/
public SingleOutputStreamOperator flatMap(FlatMapFunction flatMapper) {
//获取flatMapper的输出类型
TypeInformation outType = TypeExtractor.getFlatMapReturnTypes(clean(flatMapper),
getType(), Utils.getCallLocationName(), true);
return transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));
}
clean函数我们之前讲过,主要是用来去掉FlatMapFunction类对象的外部类(变量是$this#)的部分,确保能够进行序列化操作。
getType函数是获取类对象的输出类型,我们进入到这个函数中,代码如下:
/**
* Gets the type of the stream.
*
* @return The type of the datastream.
*/
public TypeInformation getType() {
return transformation.getOutputType();
}
这个函数调用Transformation类对象的getOutputType函数来获取它的输出类型,这个Transformation类属于source operator,也就是我们前面讲的DataStreamSource
接下来我们看看Utils.getCallLocationName()函数,这个函数通过获取该线程中的栈帧信息。
我们继续分析函数getFlatMapReturnTypes,代码如下:
@PublicEvolving
public static TypeInformation getFlatMapReturnTypes(FlatMapFunction flatMapInterface, TypeInformation inType,
String functionName, boolean allowMissing)
{
return getUnaryOperatorReturnType(
(Function) flatMapInterface,
FlatMapFunction.class,
0,
1,
new int[]{1, 0},
inType,
functionName,
allowMissing);
}
getUnaryOperatorReturnType函数,这个函数用来返回flatMapInterface中的返回数据类型,代码如下:
/**
* Returns the unary operator's return type.
*
* This method can extract a type in 4 different ways:
*
*
1. By using the generics of the base class like MyFunction.
* This is what outputTypeArgumentIndex (in this example "4") is good for.
*
* 2. By using input type inference SubMyFunction.
* This is what inputTypeArgumentIndex (in this example "0") and inType is good for.
*
* 3. By using the static method that a compiler generates for Java lambdas.
* This is what lambdaOutputTypeArgumentIndices is good for. Given that MyFunction has
* the following single abstract method:
*
*
*
* void apply(IN value, Collector value)
*
*
*
* Lambda type indices allow the extraction of a type from lambdas. To extract the
* output type OUT from the function one should pass {@code new int[] {1, 0}}.
* "1" for selecting the parameter and 0 for the first generic in this type.
* Use {@code TypeExtractor.NO_INDEX} for selecting the return type of the lambda for
* extraction or if the class cannot be a lambda because it is not a single abstract
* method interface.
*
*
4. By using interfaces such as {@link TypeInfoFactory} or {@link ResultTypeQueryable}.
*
*
See also comments in the header of this class.
*
* @param function Function to extract the return type from
* @param baseClass Base class of the function
* @param inputTypeArgumentIndex Index of input generic type in the base class specification (ignored if inType is null)
* @param outputTypeArgumentIndex Index of output generic type in the base class specification
* @param lambdaOutputTypeArgumentIndices Table of indices of the type argument specifying the input type. See example.
* @param inType Type of the input elements (In case of an iterable, it is the element type) or null
* @param functionName Function name
* @param allowMissing Can the type information be missing (this generates a MissingTypeInfo for postponing an exception)
* @param Input type
* @param Output type
* @return TypeInformation of the return type of the function
*/
@SuppressWarnings("unchecked")
@PublicEvolving
public static TypeInformation getUnaryOperatorReturnType(
Function function,
Class> baseClass,
int inputTypeArgumentIndex,
int outputTypeArgumentIndex,
int[] lambdaOutputTypeArgumentIndices,
TypeInformation inType,
String functionName,
boolean allowMissing) {
Preconditions.checkArgument(inType == null || inputTypeArgumentIndex >= 0, "Input type argument index was not provided");
Preconditions.checkArgument(outputTypeArgumentIndex >= 0, "Output type argument index was not provided");
Preconditions.checkArgument(
lambdaOutputTypeArgumentIndices != null,
"Indices for output type arguments within lambda not provided");
// explicit result type has highest precedence
//如果实现了ResultTypeQueryable接口,那么直接通过function中的getProducedType函数返
//回输出类型
if (function instanceof ResultTypeQueryable) {
return ((ResultTypeQueryable) function).getProducedType();
}
// perform extraction
try {
final LambdaExecutable exec;
try {
//判断function是否是lambda表达式
exec = checkAndExtractLambda(function);
} catch (TypeExtractionException e) {
throw new InvalidTypesException("Internal error occurred.", e);
}
//如果是lambda实现的function,那么就根据lambda表达式特点来获取相应的返回类型
if (exec != null) {
// parameters must be accessed from behind, since JVM can add additional parameters e.g. when using local variables inside lambda function
// paramLen is the total number of parameters of the provided lambda, it includes parameters added through closure
final int paramLen = exec.getParameterTypes().length;
final Method sam = TypeExtractionUtils.getSingleAbstractMethod(baseClass);
// number of parameters the SAM of implemented interface has; the parameter indexing applies to this range
final int baseParametersLen = sam.getParameterTypes().length;
final Type output;
if (lambdaOutputTypeArgumentIndices.length > 0) {
output = TypeExtractionUtils.extractTypeFromLambda(
baseClass,
exec,
lambdaOutputTypeArgumentIndices,
paramLen,
baseParametersLen);
} else {
output = exec.getReturnType();
TypeExtractionUtils.validateLambdaType(baseClass, output);
}
return new TypeExtractor().privateCreateTypeInfo(output, inType, null);
} else {//通过反射来获取输出类型
if (inType != null) {
validateInputType(baseClass, function.getClass(), inputTypeArgumentIndex, inType);
}
return new TypeExtractor().privateCreateTypeInfo(baseClass, function.getClass(), outputTypeArgumentIndex, inType, null);
}
}
catch (InvalidTypesException e) {
if (allowMissing) {
return (TypeInformation) new MissingTypeInfo(functionName != null ? functionName : function.toString(), e);
} else {
throw e;
}
}
}
checkAndExtractLambda函数代码如下:
/**
* Checks if the given function has been implemented using a Java 8 lambda. If yes, a LambdaExecutable
* is returned describing the method/constructor. Otherwise null.
*
* @throws TypeExtractionException lambda extraction is pretty hacky, it might fail for unknown JVM issues.
*/
public static LambdaExecutable checkAndExtractLambda(Function function) throws TypeExtractionException {
try {
// get serialized lambda
SerializedLambda serializedLambda = null;
for (Class> clazz = function.getClass(); clazz != null; clazz = clazz.getSuperclass()) {
try {
//1、函数式接口继承Serializable时,编译器在编译Lambda表达式时,生成了一个writeReplace方法,这个方法会返回SerializedLambda,可以反射调用这个方法;
2、SerializedLambda是对Lambda表达式进行描述的对象,在Lambda表达式可序列化的时候(函数式接口继承Serializable)才能得到;
Method replaceMethod = clazz.getDeclaredMethod("writeReplace");
replaceMethod.setAccessible(true);
Object serialVersion = replaceMethod.invoke(function);
// check if class is a lambda function
if (serialVersion != null && serialVersion.getClass() == SerializedLambda.class) {
serializedLambda = (SerializedLambda) serialVersion;
break;
}
}
catch (NoSuchMethodException e) {
// thrown if the method is not there. fall through the loop
}
}
// not a lambda method -> return null
if (serializedLambda == null) {
return null;
}
// find lambda method
String className = serializedLambda.getImplClass();
String methodName = serializedLambda.getImplMethodName();
String methodSig = serializedLambda.getImplMethodSignature();
Class> implClass = Class.forName(className.replace('/', '.'), true, Thread.currentThread().getContextClassLoader());
// find constructor
if (methodName.equals("")) {
Constructor>[] constructors = implClass.getDeclaredConstructors();
for (Constructor> constructor : constructors) {
if (getConstructorDescriptor(constructor).equals(methodSig)) {
return new LambdaExecutable(constructor);
}
}
}
// find method
else {
List methods = getAllDeclaredMethods(implClass);
for (Method method : methods) {
if (method.getName().equals(methodName) && getMethodDescriptor(method).equals(methodSig)) {
return new LambdaExecutable(method);
}
}
}
throw new TypeExtractionException("No lambda method found.");
}
catch (Exception e) {
throw new TypeExtractionException("Could not extract lambda method out of function: " +
e.getClass().getSimpleName() + " - " + e.getMessage(), e);
}
}
关于lambda表达式的详细内容可以参考https://blog.csdn.net/u012503481/article/details/100896507和https://blog.csdn.net/lirx_tech/article/details/51303966
拿到输出类型后,那么运行transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));
首先将flatMapper中的外部类清除掉,然后创建StreamFlatMap类对象,StreamFlatMap是一个operator,继承结构图如下:
operator继承结构图我们进入到函数transform中,代码如下:
/**
* Method for passing user defined operators along with the type
* information that will transform the DataStream.
*
* @param operatorName
* name of the operator, for logging purposes
* @param outTypeInfo
* the output type of the operator
* @param operator
* the object containing the transformation logic
* @param
* type of the return stream
* @return the data stream constructed
*/
@PublicEvolving
public SingleOutputStreamOperator transform(String operatorName, TypeInformation outTypeInfo, OneInputStreamOperator operator) {
// read the output type of the input Transform to coax out errors about MissingTypeInfo
transformation.getOutputType();
OneInputTransformation resultTransform = new OneInputTransformation<>(
this.transformation,
operatorName,
operator,
outTypeInfo,
environment.getParallelism());
@SuppressWarnings({ "unchecked", "rawtypes" })
SingleOutputStreamOperator returnStream = new SingleOutputStreamOperator(environment, resultTransform);
//将resultTransform对象添加到环境变量env中的transformations列表中
getExecutionEnvironment().addOperator(resultTransform);
return returnStream;
}
总结一下上游stream调用flatmap函数的流程:
1、获取Function的输出类型
2、在transform函数中,创建一个Transform类对象,里面有输入的Transform类对象,有操作operator,有输出类型,名称,并行度,这样上一个Transform类对象和下一个Transform类对象就链接起来类,然后返回一个流对象,最后将这个Transform类对象放入到env中的transform类对象列表中。其实flink的整个流程就是通过Transform类对象组合起来的,Transform类对象中包含了它的上一个Transform类对象,本类对象的操作类,输出类型,并行度,名称等,如此便串联了起来,形成一个流链条。