首先不得不提一点,每一个算子都是有状态的,算子的状态也是flink能够从错误中恢复的基础. 算子的执行状态称为状态后端,状态是可以被程序访问,甚至我们可以自己及写代码访问状态.比如广播就利用了这个特性,首先将流广播出去,然后通过状态句柄去访问广播出去的流.
可以说理解算子状态是学习flink的核心. 状态的存储见我的其他的文章.多说一句, flink运行过程中真正有意义的数据就是状态数据,状态数据就是中间结果. 每个算子operation 计算的中间结果就是状态. 本章只讲解常见的算子operation并不讲解状态,这里之所以说出来是为了提醒读者注意了解flink的状态的意义.
map是对流中的每个T类型元素做处理之后返回新的类型为R元素,然后将R元素组成的流作为新的流往后流动.
下面是scala版本map函数的定义
Creates a new DataStream by applying the given function to every element of this DataStream.(翻译:通过对传入方法中的每个元素做处理,然后返回一个新的流)
def map[R: TypeInformation](fun: T => R): DataStream[R] {...省略详细代码...}
下面说map函数:看上面函数参数的定义fun: T => R,意思是该函数的参数是一个用户传入函数,该函数的参数类型为流中类型为T的元素,经过处理之后返回一个类型为R的元素
例子:ds..map(x=>{
x+1
})
下面是java版本map函数的源码定义:
public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper) {......}
该函数的参数是一个MapFunction<T, R> mapper 接口的实现,点开该接口源码如下:
@Public
@FunctionalInterface
public interface MapFunction<T, O> extends Function, Serializable {
/**
* The mapping method. Takes an element from the input data set and transforms it into exactly
* one element.
*
* @param value The input value.
* @return The transformed value
* @throws Exception This method may throw exceptions. Throwing an exception will cause the
* operation to fail and may trigger recovery.
*/
O map(T value) throws Exception;
}
java中被@FunctionalInterface注解修饰的接口且该接口只有一个抽象方法,那么表示该接口符合lambda表达式的定义,因此可以简化写成lambda的样式:
下面是例子:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2.of(1L, 4L), Tuple2.of(1L, 2L))
.keyBy(value -> value.f0)
.map( value -> value.getField(0))
.print();
传入一个元素,根据当前传入的单个元素可能会生成一个或者一个以上的元素
java版本
dataStream.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out)
throws Exception {
for(String word: value.split(" ")){
out.collect(word);
}
}
});
scala版本
dataStream.flatMap { str => str.split(" ") }
用自定义的逻辑检测一个元素,如果希望这个元素向下流动就返回true,如果洗碗粉抛弃掉该元素就返回false:
java版本
dataStream.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) throws Exception {
return value != 0;
}
});
scala版本
dataStream.filter { _ != 0 }
逻辑上将流划分为不相连的分区,具有相同key的流元素都被分配到相同的分区。不同分区的数据会交给不同的task去执行,底层其实使用了hash分区的方式.
既然是根据hash分区,因此如果key的选择是一个对象且这个对象没有实现自己的hashcode方法,那么个的对象是不能作为key的. 另外任何Array也不能作为key
多说一句keyBy分组会触发分区,会影响task的数量
源码定义:
public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key) {}
由此可见keyBy方法接收一个KeySelector 的实现类,下面是KeySelector接口的定义
@Public
@FunctionalInterface
public interface KeySelector<IN, KEY> extends Function, Serializable {
KEY getKey(IN value) throws Exception;
}
例子:
java版本
dataStream.keyBy(new KeySelector<Tuple2<Long, Long>, String>() {
@Override
public String getKey(Tuple2<Long, Long> value) throws Exception {
return value.getField(0);
}
})
scala 版本
1. ds.keyBy(new KeySelector[(Long,Long),Long] {
override def getKey(value: (Long, Long)): Long = value._1
})
scala 最简单的写法如下:
2. ds.keyBy(x=>x._1) 这种写法和上面一样,但是不推荐了,scala版本方法定义明说了推荐方式:
@deprecated("use [[DataStream.keyBy(KeySelector)]] instead")
def keyBy(fields: Int*): KeyedStream[T, JavaTuple] = asScalaStream(stream.keyBy(fields: _*))
意思即是推荐scala的第一种写法.
对keyBy处理后的数据流做“滚动”操作。将当前元素与最近的做操作,并发出操作后的新值。新的值与下一个值进行同样的操作,然后发出新的值,依次往后计算,直到最后形成的新的数据流是由发出的新的值组成的. 注意reduce只能用于keyBy 之后的数据流. 对于keyBy数据流,相同的key会交给一个线程处理. 所以如果keyBy数据流有多个key, 那么对于reduce而言会有多个不同的线程去独立处理, 处理的结果是根据key独立的. 下面看scala版本的例子:
object Test{
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromCollection(List(
(1L, 3L),
(1L, 5L),
(1L,1L),
(1L,1L),
(4L, 7L),
(4L, 3L),
(1L, 2L)
)).keyBy(fun = new KeySelector[(Long, Long), Long] {
override def getKey(value: (Long, Long)): Long = value._1
}).reduce(new ReduceFunction[(Long, Long)] {
override def reduce(value1: (Long, Long), value2: (Long, Long)): (Long, Long) = (value1._1,value1._2+value2._2)
}).setParallelism(1).writeAsText("D:\\flink\\a.txt")
// the printed output will be (1,4) and (1,5)
env.execute("ExampleKeyedState")
}
}
输出结果如下:
(1,3)
(1,8)
(1,9)
(1,10)
(1,12)
(4,7)
(4,10)
可以看到结果中key为1 key为2是并行存在的两个独立的结果
下面是reduce函数源码请自己看注释:
/**
* Creates a new [[DataStream]] by reducing the elements of this DataStream
* using an associative reduce function. An independent aggregate is kept per key.注意: kept per key这三个单词
*/
def reduce(reducer: ReduceFunction[T]): DataStream[T] = {...}
下面看看ReduceFunction接口源码:
@Public
@FunctionalInterface
public interface ReduceFunction<T> extends Function, Serializable {
/**
* The core method of ReduceFunction, combining two values into one value of the same type. The
* reduce function is consecutively applied to all values of a group until only a single value
* remains.
*
* @param value1 The first value to combine.
* @param value2 The second value to combine.
* @return The combined value of both input values.
* @throws Exception This method may throw exceptions. Throwing an exception will cause the
* operation to fail and may trigger recovery.
*/
T reduce(T value1, T value2) throws Exception;
}
window :注意此函数只用于处理keyBy处理后的键值流数据,应用于窗口函数,每个窗口做一次计算,窗口的计算结果是独立的: 换句话说,window函数后面函数执行逻辑是基于key 独立计算的. 也即是窗口在不同的key上独立计算.
aggregate: aggregate函数用于处理当前window的数据,他有三个方法:
object StreamingJob {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
// val env = StreamExecutionEnvironment.createRemoteEnvironment("LOCALHOST",8081,"D:\\IT\\Project\\FlinkDemo\\target\\FlinkDemo-1.0-SNAPSHOT.jar")
val text = env.socketTextStream("localhost", 9999)
val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
.map { (_, 1) }
.keyBy(_._1)
.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
.sum(1)
counts.print()
env.execute("Window 333 WordCount")
}
}
为了理解窗口基于key独立计算的逻辑,下面在看一个java版本的代码:
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* 测试AggFunction——求各个班级英语成绩平均分,下面是一个基于元素数量计算的窗口,当窗口检测到两个元素到来的时候就会触发计算.CountTrigger.of(2)意思就是当前key对应的窗口
* 每检测到两个元素就会触发计算
*
*
*/
public class TestAggFunctionOnWindow {
private static final Logger logger = LoggerFactory.getLogger(TestAggFunctionOnWindow.class);
public static void main(String[] args) throws Exception {
// 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 读取数据
DataStream<Tuple3<String, String, Long>> input = env.fromElements(ENGLISH);
// 求各个班级英语成绩平均分
SingleOutputStreamOperator<Tuple2<String, Double>> ds = input.keyBy(0).window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(2))).aggregate(new MyAgg());
// ds.print();
ds.addSink(new PrintSinkFunction<>("这是我的自定义输出:", false));
env.execute("TestAggFunctionOnWindow");
}
public static final Tuple3[] ENGLISH = new Tuple3[] {
Tuple3.of("一班", "张三", 1L),
Tuple3.of("一班", "李四", 2L),
Tuple3.of("一班", "王五", 3L),
Tuple3.of("二班", "赵六", 4L),
Tuple3.of("二班", "小七", 5L),
Tuple3.of("二班", "小八", 6L),
};
}
class MyAgg implements AggregateFunction<Tuple3<String, String, Long>, Tuple3<String,Long, Long>, Tuple2<String,Double>> {
/**
* 创建累加器保存中间状态
* Tupel<班级名称,总分数,总人数>
*/
@Override
public Tuple3<String,Long, Long> createAccumulator() {
return new Tuple3<>("",0L, 0L);
}
/**
* 将元素添加到累加器并返回新的累加器
*
* @param value 输入类型
* @param acc 累加器ACC类型
*
* @return 返回新的累加器
*/
@Override
public Tuple3<String,Long, Long> add(Tuple3<String, String, Long> value, Tuple3<String,Long, Long> acc) {
//acc.f0 班级名称
//acc.f1 总分数
//acc.f2 总人数
//value.f0 表示班级 value.f1 表示姓名 value.f2 表示分数
return new Tuple3<String, Long, Long>(value.f0,acc.f1 + value.f2, acc.f2 + 1L);
}
@Override
public Tuple2<String,Double> getResult(Tuple3<String,Long, Long> acc) {
return new Tuple2<>(acc.f0,((double) acc.f1) / acc.f2);
}
@Override
public Tuple3<String,Long, Long> merge(Tuple3<String,Long, Long> acc1, Tuple3<String,Long, Long> acc2) {
System.out.println("这个函数不会被执行,只有sessoin窗口函数才会被触发,请忽略此方法");
return new Tuple3<>("",1L,1L);
}
结果如下:
这是我的自定义输出::4> (一班,1.5)
这是我的自定义输出::2> (二班,4.5)
结果分析:
看到没有keyBy 依据班级分成了两个分区, window函数后面的计算逻辑在分区之间是独立计算的. 过程如下:
第一个分区检测到:Tuple3.of("一班", "张三", 1L),
Tuple3.of("一班", "李四", 2L),
因为窗口数量为2就会触发索引结果为:(一班,1.5)
第二个分区检测到:Tuple3.of("二班", "赵六", 4L),
Tuple3.of("二班", "小七", 5L),
同理触发窗口计算结果为:(二班,4.5)
有人可能会注意到我在打印结果的时候没有用:ds.print()
而是用了:ds.addSink(new PrintSinkFunction<>("这是我的自定义输出:", false));
如果你看到了这里请点挂机print()看源码就会看到:
@PublicEvolving
public DataStreamSink<T> print() {
PrintSinkFunction<T> printFunction = new PrintSinkFunction<>();
return addSink(printFunction).name("Print to Std. Out");
}
所以说print()方法,底层调用的还是addSink,上面代码用了new PrintSinkFunction<>(); 通过看源码你会看到:打印输出我们可以自定义前缀的,这样方便我们调试.
在keyby后数据跟据指定的key被切. 相同的key会被分配到同一个窗口任务中(可理解为独立线程), window后面的清洗逻辑是在独立线程中分别运行的
而调用windowAll之前不需要调用keyBy函数,windowall则把所有的key都聚合起来所以windowall的并行度只能为1,而window可以有多个并行度。
上面说的东西非常重要,如果看不懂的话请停下来.
先看源码:
public <R> SingleOutputStreamOperator<R> apply(WindowFunction<T, R, K, W> function) {
TypeInformation<R> resultType = getWindowFunctionReturnType(function, getInputType());
return apply(function, resultType);
}
下面是WindowFunction 接口:
/**
* Base interface for functions that are evaluated over keyed (grouped) windows.
*
* @param <IN> The type of the input value. //流数据元素类型
* @param <OUT> The type of the output value.//处理完后输出元素的类型
* @param <KEY> The type of the key.//key 的类型
* @param <W> The type of {@code Window} that this window function can be applied on.//window 的类型, 因为window有很多实现类
*/
@Public
public interface WindowFunction<IN, OUT, KEY, W extends Window> extends Function, Serializable {
/**
* Evaluates the window and outputs none or several elements.
*
* @param key The key for which this window is evaluated.
* @param window The window that is being evaluated.
* @param input The elements in the window being evaluated.
* @param out A collector for emitting elements.
* @throws Exception The function may throw exceptions to fail the program and trigger recovery.
*/
void apply(KEY key, W window, Iterable<IN> input, Collector<OUT> out) throws Exception;
}
apply用于在keyBy, window之后,用于对分区之后的每个key对应的独立处理线程中的每个元素做处理.
下面是一个demo,用于对每个window窗口中:
apply什么时候执行?
执行的时候应当是窗口被触发运算的时候
代码:
package com.pg.flink;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(2)));等同于countWindow(2),你可以点开源码来看
* input.keyBy(new MyKeySelector());
* 和input.keyBy((KeySelector, String>) value -> value.f0);是一样的
* countWindow(2) :意思是构建了一个计数窗口,
也就是当前窗口检测到两条数据的时候会触发运算.
*/
public class WindowApply {
private static final Logger logger = LoggerFactory.getLogger(TestAggFunctionOnWindow.class);
public static void main(String[] args) throws Exception {
logger.info("程序开始运行....");
// 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 读取数据
DataStream<Tuple3<String, String, Long>> input = env.fromElements(ENGLISH);
// 求各个班级英语成绩平均分
//KeyedStream, String> keyedStreams= input.keyBy((KeySelector, String>) value -> value.f0);
KeyedStream<Tuple3<String, String, Long>, String> keyedStreams= input.keyBy(new MyKeySelector());
WindowedStream<Tuple3<String, String, Long>, String, GlobalWindow> ws = keyedStreams.countWindow(2);
SingleOutputStreamOperator<Tuple3<String, String, Long>> ds = ws.apply(new MyWindowFunction());
ds.addSink(new PrintSinkFunction<>("这是我的自定义输出:", false));
env.execute("TestAggFunctionOnWindow");
}
public static final Tuple3[] ENGLISH = new Tuple3[] {
Tuple3.of("一班", "张三", 1L),
Tuple3.of("一班", "李四", 2L),
Tuple3.of("一班", "王五", 3L),
Tuple3.of("二班", "赵六", 4L),
Tuple3.of("二班", "小七", 5L),
Tuple3.of("二班", "小八", 6L),
};
public static class MyWindowFunction implements WindowFunction<Tuple3<String, String, Long>, Tuple3<String, String, Long>, String, GlobalWindow>{
@Override
public void apply(String s, GlobalWindow window, Iterable<Tuple3<String, String, Long>> input, Collector<Tuple3<String, String, Long>> out) throws Exception {
for (Tuple3<String, String, Long> e: input) {
out.collect(new Tuple3<>(e.f0+s,e.f1,e.f2*10));
}
}
}
public static class MyKeySelector implements KeySelector<Tuple3<String, String, Long>, String>{
@Override
public String getKey(Tuple3<String, String, Long> value) throws Exception {
return value.f0;
}
}
}
上面代码构造了一个计数窗口基于班级名称做分区,下面数据就两个班级,因此keyBy之后会分成两个独立的窗口处理线程, 二者独立运行. 窗口触发的条件是当前窗口有两个数据的时候.
当窗口触发之后apply用于处理当前窗口的数据. 代码中我们每个班级有三条数据,而窗口的触发是:当窗口遇到两条数据的时候被触发.
代码中keyBy基于班级名称做分流,于是(基于下面的数据)会产生两个独立的窗口处理线程
窗口处理线程一: Tuple3.of("一班", "张三", 1L), Tuple3.of("一班", "李四", 2L), Tuple3.of("一班", "王五", 3L), 当窗口触发计算的时候(检测到两条数据):调用apply Tuple3.of("一班", "张三", 1L), Tuple3.of("一班", "李四", 2L), 变成: Tuple3.of("一班一班", "张三", 10L), Tuple3.of("一班一班", "李四", 20L), 而Tuple3.of("一班", "王五", 3L)被抛弃 窗口处理线程一二: Tuple3.of("二班", "赵六", 4L), Tuple3.of("二班", "小七", 5L), Tuple3.of("二班", "小八", 6L), 当窗口触发计算的时候(检测到两条数据):调用apply 同理结果为: Tuple3.of("二班二班", "赵六", 40L), Tuple3.of("二班二班", "小七", 50L), 而Tuple3.of("二班", "小八", 6L),被抛弃
所以最终两个独立的窗口线程的输出结果,也就是程序的最终输出结果:
这是我的自定义输出::2> (二班二班,赵六,40)
这是我的自定义输出::2> (二班二班,小七,50)
这是我的自定义输出::4> (一班一班,张三,10)
这是我的自定义输出::4> (一班一班,李四,20)
注意:当你不用window而是用的windowAll, windowAll意思就是不根据keyBy分区,也就是所有的数据都跑到一个窗口处理,此时调用apply的时候需要用AllWindowFunction而不是WindowFunction ,二者很相似这里不真多windowAll的apply方法多做阐述.
顾名思义针对window窗口数据(以key切分), 当前元素与下一个元素做逻辑将生成的新元素返回, 新的元素和下一个元素做下一轮逻辑,然后将生成的新的元素返回,依次往后…知道当前window被触发. reduce函数在窗口触发的时候开始计算,也就意味着一个当窗口触发的时候整个窗口中的元素会被合并成一个发出到下游。
demo:
package com.pg.flink;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.datastream.WindowedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.windows.GlobalWindow;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class WindowReduceDemo {
private static final Logger logger = LoggerFactory.getLogger(TestAggFunctionOnWindow.class);
public static void main(String[] args) throws Exception {
logger.info("程序开始运行....");
// 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 读取数据
DataStream<Tuple3<String, String, Long>> input = env.fromElements(ENGLISH);
// 求各个班级英语成绩平均分
//KeyedStream, String> keyedStreams= input.keyBy((KeySelector, String>) value -> value.f0);
KeyedStream<Tuple3<String, String, Long>, String> keyedStreams= input.keyBy(new WindowReduceDemo.MyKeySelector());
WindowedStream<Tuple3<String, String, Long>, String, GlobalWindow> ws = keyedStreams.countWindow(2);
SingleOutputStreamOperator<Tuple3<String, String, Long>> ds = ws.reduce(new MyReduce());
ds.addSink(new PrintSinkFunction<>("这是我的自定义输出:", false));
env.execute("TestAggFunctionOnWindow");
}
public static final Tuple3[] ENGLISH = new Tuple3[] {
Tuple3.of("一班", "张三", 1L),
Tuple3.of("一班", "李四", 2L),
Tuple3.of("一班", "王五", 3L),
Tuple3.of("二班", "赵六", 4L),
Tuple3.of("二班", "小七", 5L),
Tuple3.of("二班", "小八", 6L),
};
public static class MyReduce implements ReduceFunction<Tuple3<String, String, Long>>{
@Override
public Tuple3<String, String, Long> reduce(Tuple3<String, String, Long> value1, Tuple3<String, String, Long> value2) throws Exception {
return Tuple3.of(value1.f0+value2.f0,value1.f1+value2.f1, value1.f2+value2.f2); }
}
public static class MyKeySelector implements KeySelector<Tuple3<String, String, Long>, String> {
@Override
public String getKey(Tuple3<String, String, Long> value) throws Exception {
return value.f0;
}
}
}
//countWindow(2)意思是当前窗口检测到两个元素就会触发计算
结果如下:
这是我的自定义输出::4> (一班一班,张三李四,3)
这是我的自定义输出::2> (二班二班,赵六小七,9)
将两个数据格式一致的数据流拼接成一个数据流,要注意数据格式要保持一致,否则无法合并流数。
package com.pg.flink;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class UnionDemo {
private static final Logger logger = LoggerFactory.getLogger(UnionDemo.class);
public static void main(String[] args) throws Exception {
logger.info("程序开始运行....");
// 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 读取数据
DataStream input01 = env.fromElements(ONE);
DataStream input02 = env.fromElements(TWO);
DataStream ds=input01.union(input02);
ds.addSink(new PrintSinkFunction<>("这是我的自定义输出:", false));
env.execute("TestAggFunctionOnWindow");
}
public static final String[] ONE = new String[] {
Tuple3.of("一班", "张三", 1L).toString(),
Tuple3.of("一班", "李四", 2L).toString(),
Tuple3.of("一班", "王五", 3L).toString()
};
public static final String[] TWO = new String[]{
Tuple3.of("二班", "赵六", 4L).toString(),
Tuple3.of("二班", "小七", 5L).toString(),
Tuple3.of("二班", "小八", 6L).toString()
};
}
结果如下:
这是我的自定义输出::7> (二班,小七,5)
这是我的自定义输出::6> (二班,赵六,4)
这是我的自定义输出::8> (二班,小八,6)
这是我的自定义输出::1> (一班,李四,2)
这是我的自定义输出::2> (一班,王五,3)
这是我的自定义输出::8> (一班,张三,1)
注意:window join适用的场景要求比较苛刻,
1.两个流数据在join之前必须提前设置好事件时间和水位线策略
2.两个流数据中的水位线不能太长。 因为水位线是基于事件时间的,也就是说两个流中的事件时间不能相差很大
上面两条决定了join的两条流在业务层因当是相关联甚至是同时产生的,如果时间相差太大会导致状态激增,且不会触发窗口计算。
window join的代码结构:
stream.join(otherStream)
.where(<KeySelector>)
.equalTo(<KeySelector>)
.window(<WindowAssigner>)
.apply(<JoinFunction>);
join只能用于window
窗口联结在代码中的实现,首先需要调用 DataStream 的.join()方法来合并两条流,得到一个 JoinedStreams;接着通过.where()和.equalTo()方法指定两条流中联结的 key;然后通过.window()开窗口,并调用.apply()传入联结窗口函数进行处理计算。
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
//流数据在底层被封装成一个个对象, 该对象的value(我们的数据)是真实的数据,该对象还有一个时间字段Timestamp,
// 该时间字段可以基于value中数据的时间生成,也可以用其他方式生成,该时间字段用来判断事件的时效性, value中的时间
// 1. Timestamp和Watermark都是基于事件的时间字段生成的,然后基于流数据的时间生成流数据的
// 2. Timestamp和Watermark是两个不同的东西,并且一旦生成都跟事件数据没有关系了(所有即使事件中不再包含生成Timestamp和Watermark的字段也没关系)
// 3. 事件数据和 Timestamp 一一对应(事件在流中传递以StreamRecord对象表示,value 和 timestamp 是它的两个成员变量)
// 4. Watermark 在生成之后与事件数据没有直接关系,Watermark 作为一个消息,和事件数据一样在流中传递(Watermark 和StreamRecord 具有相同的父类:StreamElement)
// 5. Timestamp 与 Watermark 在生成之后,会在下游window算子中做比较,判断事件数据是否是过期数据
// 6. 只有window算子才会用Watermark判断事件数据是否过期
public class windowJoin {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
//定义两条流
DataStream> stream1 = env.fromElements(
Tuple2.of("a", 1L),
Tuple2.of("b", 2L),
Tuple2.of("a", 2000L),
Tuple2.of("b", 2000L)
Tuple2.of("c", 8000)
).assignTimestampsAndWatermarks(WatermarkStrategy.>forMonotonousTimestamps()//定义单调递增的水位发射器
.withTimestampAssigner(new SerializableTimestampAssigner>() {
@Override
public long extractTimestamp(Tuple2 stringLongTuple2, long l) {
// System.out.println("pggggg#######################"+ stringLongTuple2.f1);
return stringLongTuple2.f1;
}
}
)
);
DataStream> stream2 = env.fromElements(
Tuple2.of("a",4L),
Tuple2.of("b", 6L),
Tuple2.of("a", 6L),
Tuple2.of("b", 4000L),
Tuple2.of("d", 9000)
).assignTimestampsAndWatermarks(WatermarkStrategy.>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner>() {
@Override
public long extractTimestamp(Tuple2 stringLongTuple2, long l) {
return stringLongTuple2.f1;
}
}
)
);
stream1.join(stream2)
.where(data -> data.f0)
.equalTo(data -> data.f0)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new JoinFunction, Tuple2, String>() {
@Override
public String join(Tuple2 left, Tuple2 right) throws Exception {
return left + "=>" + right;
}
})
.print();
env.execute();
}
}
结果如下:
// (a,1)=>(a,4)
// (a,1)=>(a,6)
// (a,2000)=>(a,4)
// (a,2000)=>(a,6)
// (b,2)=>(b,6)
// (b,2)=>(b,4000)
// (b,2000)=>(b,6)
// (b,2000)=>(b,4000)
注意: Tuple2.of(“c”, 8000) Tuple2.of(“d”, 9000)都被抛弃,因此这是一个内连接。
类似于 SELECT * FROM tab1 INNER JOIN tab2 ON tab1.id1 = tab2.id2
mysql的内连接会去除没匹配到的数据,flink window join 也是如此。
public class windowJoin {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//定义两条流
DataStream> stream1 = env.fromElements(
Tuple2.of("a", 1L),
Tuple2.of("b", 2L),
Tuple2.of("a", 2000L),
Tuple2.of("b", 2000L),
Tuple2.of("c", 2001L)
);
DataStream> stream2 = env.fromElements(
Tuple2.of("a",4L),
Tuple2.of("b", 6L),
Tuple2.of("a", 6L),
Tuple2.of("b", 4000L),
Tuple2.of("d", 20021L)
);
stream1.join(stream2)
.where(data -> data.f0)
.equalTo(data -> data.f0)
.window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(4)))
//意思是满足join条件的新的数据达到四条就开始计算,两个数据流正好有四个a,因此可以计算到
//如果CountTrigger.of(4)则不会触发计算,因为数量先计算stream1,而stream1正好有两条a,
//但是触发计 算的时候没有获取到stream2的数据,因此此次计算被抛弃
.apply(new JoinFunction, Tuple2, String>() {
@Override
public String join(Tuple2 left, Tuple2 right) throws Exception {
return left + "=>" + right;
}
})
.print();
env.execute();
}
}
结果如下:
(a,1)=>(a,4)
(a,1)=>(a,6)
(a,2000)=>(a,4)
(a,2000)=>(a,6)
(b,2)=>(b,6)
(b,2)=>(b,4000)
(b,2000)=>(b,6)
(b,2000)=>(b,4000)
多谢–>此博主的解释
join只能实现在同一个窗口的两个数据流之间进行join, 但是在实际中常常是会存在数据乱序或者延时的情况,导致两个流的数据进度不一致,就会出现数据跨窗口的情况,那么数据就无法在同一个窗口内join。flink 基于KeyedStream提供了一种interval join 机制,intervaljoin 连接两个keyedStream, 按照相同的key在一个相对数据时间的时间段内进行连接。
先看一个假设的案例:用户购买商品过程中填写收货地址然后下单,在这个过程中产生两个数据流,一个是订单数据流包含用户id、商品id、订单时间、订单金额、收货id等,另一个是收货信息数据流包含收货id、收货人、收货人联系方式、收货人地址等,系统在处理过程中,先发送订单数据,在之后的1到5秒内会发送收货数据,现在要求实时统计按照不同区域维度的订单金额的top100地区。在这个案例中两个数据流:订单流orderStream先,收货信息流addressStream后,需要将这两个数据流按照收货id join之后计算top100订单金额的地区,由于orderStream比addressStream早1到5秒,那么就有这样一个关系:orderStream.time+lefttime<=addressStream.time<=orderStream.time+righttime
Flink intervalJoin 使用与原理分析
https://blog.51cto.com/u_9928699/3702825
注意intervalJoin和join 不同,join近乎能用在所有流中,intervaljoin只能用于keystream,并且只用于时间窗口中,intervalJoin不用于计数窗口countWindow中
package com.test.demo.stream;
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
public class windowIntervalJoin {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//定义两条流
DataStream> stream1 = env.fromElements(
Tuple2.of("a", 1664367613000L),
Tuple2.of("b", 1664367613000L),
Tuple2.of("a", 1664367613000L),
Tuple2.of("b", 1664367613000L),
Tuple2.of("c", 1664367613000L)
).assignTimestampsAndWatermarks(WatermarkStrategy.>forMonotonousTimestamps().withTimestampAssigner(new SerializableTimestampAssigner>() {
@Override
public long extractTimestamp(Tuple2 element, long recordTimestamp) {
return element.f1;
}
}));
DataStream> stream2 = env.fromElements(
Tuple2.of("a", 1664367613150L),
Tuple2.of("b", 1664367613150L),
Tuple2.of("a", 1664367613150L),
Tuple2.of("b", 1664367613150L),
Tuple2.of("d", 1664367613150L)
).assignTimestampsAndWatermarks(WatermarkStrategy.>forMonotonousTimestamps().withTimestampAssigner(new SerializableTimestampAssigner>() {
@Override
public long extractTimestamp(Tuple2 element, long recordTimestamp) {
return element.f1;
}
}));
// stream1.ts + lowerBound <= stream2.ts <= stream1.ts + upperBound ->请重点看这个表达式
stream1.keyBy(x->x.f0)
.intervalJoin(stream2.keyBy(x->x.f0))
.between(Time.milliseconds(-100), Time.milliseconds(200))//请把这一行的200改成100你会发现就没有输出结果了。
.process(new ProcessJoinFunction, Tuple2, String>() {
@Override
public void processElement(Tuple2 left, Tuple2 right, Context ctx, Collector out) throws Exception {
out.collect(left.toString()+">>>"+right.toString());
}
}).print("输出结果》》》》》》》》");
env.execute();
}
}
connect 和union差不多都是连接两个流变成一个流,区别就是union要求建立连接之前两个数据流的数据类型保持一致,否则直接报错。 connect则没有这个要求,但是connect要求两个流连接之后,最后返回的新的数据流的数据类型一致。 意思就是connect适合处理两个类型不一样的流,处理目的就是生成结构类型一致的流。
package com.pg.flink;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoProcessFunction;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class connectDemo {
private static final Logger logger = LoggerFactory.getLogger(connectDemo.class);
public static void main(String[] args) throws Exception {
logger.info("程序开始运行....");
// 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 读取数据
DataStreamSource> input01 = env.fromElements(ONE);
DataStreamSource> input02 = env.fromElements(TWO);
ConnectedStreams, Tuple2> connectedStreams = input01.connect(input02);
SingleOutputStreamOperator> ds = connectedStreams.process(new CoProcessFunction, Tuple2,Tuple3 >() {
@Override
public void processElement1(Tuple3 value, Context ctx, Collector> out) throws Exception {
// ctx.output(new OutputTag<>("哈哈"),value);
out.collect(value);
}
@Override
public void processElement2(Tuple2 value, Context ctx, Collector> out) throws Exception {
out.collect(Tuple3.of(value.f0,value.f1,1000L));
}
});
ds.addSink(new PrintSinkFunction<>("这是我的自定义输出:", false));
env.execute("TestAggFunctionOnWindow");
}
public static final Tuple3[] ONE = new Tuple3[] {
Tuple3.of("一班", "张三", 1L),
Tuple3.of("一班", "李四", 2L),
Tuple3.of("一班", "王五", 3L)
};
public static final Tuple2[] TWO = new Tuple2[]{
Tuple2.of("二班", "赵六"),
Tuple2.of("二班", "小七"),
Tuple2.of("二班", "小八")
};
}
结果如下:
这是我的自定义输出::3> (二班,小八,1000)
这是我的自定义输出::2> (二班,小七,1000)
这是我的自定义输出::1> (二班,赵六,1000)
这是我的自定义输出::4> (一班,王五,3)
这是我的自定义输出::2> (一班,张三,1)
这是我的自定义输出::3> (一班,李四,2)
coGroup只能用于window
coGroup 对两个流按照key进行分组, 在这之后,key相同的数据分配到同一个窗口. 然后apply函数处理当前窗口的数据
dataStream.coGroup(otherStream)
.where(new KeySelector(){…}).equalTo( KeySelector(){…})//key相同则被分派到同一个组
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.apply (new CoGroupFunction () {…});
此时假设我们有两个流:A.coGroup(B)
CoGroupFunction 接口有三个参数:
(Iterable first, Iterable second, Collector out)
first: 是A流中的数据,
second: 是B流中的数据
first和second属于同一个窗口,当前窗口中A和B中Key是相同的。
import org.apache.flink.api.common.functions.CoGroupFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.GlobalWindows;
import org.apache.flink.streaming.api.windowing.triggers.CountTrigger;
import org.apache.flink.streaming.api.windowing.triggers.PurgingTrigger;
import org.apache.flink.util.Collector;
import java.util.Iterator;
public class windowcoGroup {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//定义两条流
DataStream> stream1 = env.fromElements(
Tuple2.of("a", 1L),
Tuple2.of("b", 2L),
Tuple2.of("a", 2000L),
Tuple2.of("a", 3000L),
Tuple2.of("b", 2000L),
Tuple2.of("c", 2001L)
);
DataStream> stream2 = env.fromElements(
Tuple2.of("a",4L),
Tuple2.of("b", 6L),
Tuple2.of("a", 6L),
Tuple2.of("b", 4000L),
Tuple2.of("d", 20021L)
);
stream1.coGroup(stream2).where(new KeySelector, String>() {
@Override
public String getKey(Tuple2 value) throws Exception {
return value.f0;
}
}).equalTo(new KeySelector, String>() {
@Override
public String getKey(Tuple2 value) throws Exception {
return value.f0;
}
}) .window(GlobalWindows.create()).trigger(PurgingTrigger.of(CountTrigger.of(2)))
//意思是满足join条件的新的数据达到四条就开始计算,("a", 1L)和("a",4L)join后被看成一条join流数据
.apply(new CoGroupFunction, Tuple2, String>() {
@Override
public void coGroup(Iterable first, Iterable second, Collector out) throws Exception {
// out.collect(first.toString());
String s1="";
String s2="";
for (Object next : first) {
Tuple2 record = (Tuple2)next;
s1 += "first=>"+record.toString();
}
for (Object next : second) {
Tuple2 record = (Tuple2)next;
s2 += "second=>"+record.toString();
}
out.collect(s1 + ">>>>>>>>>>>>>>>>" + s2);
// for (Object next : second) {
// Tuple2 record = (Tuple2)next;
// out.collect("second=>"+record.toString());
// }
}
})
.print();
env.execute();
}
}
结果:
窗口被触发: first=>(a,1)first=>(a,2000)>>>>>>>>>>>>>>>>
窗口被触发: first=>(a,3000)>>>>>>>>>>>>>>>>second=>(a,4)
窗口被触发: first=>(b,2)first=>(b,2000)>>>>>>>>>>>>>>>>
窗口被触发: >>>>>>>>>>>>>>>>second=>(b,6)second=>(b,4000)
可以看到每个窗口触发的时候两个迭代器中的数据的key都是一样的。,由此可见goGroup可以实现join的功能,事实上join就是用goGroup实现的哦。
其中co是connect的前两位单词,顾名思义coMap是对连接流做map操作,然后返回新的流。
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;
public class coMapDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//定义两条流
DataStream> stream1 = env.fromElements(
Tuple2.of("a", 1L),
Tuple2.of("b", 2L),
Tuple2.of("a", 2000L),
Tuple2.of("a", 3000L),
Tuple2.of("b", 2000L),
Tuple2.of("c", 2001L)
);
DataStream> stream2 = env.fromElements(
Tuple2.of("a",4L),
Tuple2.of("b", 6L),
Tuple2.of("a", 6L),
Tuple2.of("b", 4000L),
Tuple2.of("d", 20021L)
);
stream1.connect(stream2).map(new CoMapFunction, Tuple2, String>(){
@Override
public String map1(Tuple2 value) throws Exception {
return value.toString();
}
@Override
public String map2(Tuple2 value) throws Exception {
return value.f0;
}
}).print();
env.execute();
}
}
结果:
1> b
3> b
2> a
4> d
8> a
4> (a,2000)
5> (a,3000)
3> (b,2)
6> (b,2000)
7> (c,2001)
2> (a,1)
上面的功能也能用connectStream.process(CoProcessFunction<>)实现,参考上面的connect代码。
其中co是connect的前两位单词,顾名思义coFlatMap是对连接流做Flatmap操作,然后返回新的流。
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoProcessFunction;
import org.apache.flink.streaming.api.functions.sink.PrintSinkFunction;
import org.apache.flink.util.Collector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class connectDemo {
private static final Logger logger = LoggerFactory.getLogger(connectDemo.class);
public static void main(String[] args) throws Exception {
logger.info("程序开始运行....");
// 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 读取数据
DataStreamSource> input01 = env.fromElements(ONE);
DataStreamSource> input02 = env.fromElements(TWO);
ConnectedStreams, Tuple2> connectedStreams = input01.connect(input02);
SingleOutputStreamOperator> ds = connectedStreams.process(new CoProcessFunction, Tuple2,Tuple3 >() {
@Override
public void processElement1(Tuple3 value, Context ctx, Collector> out) throws Exception {
// ctx.output(new OutputTag<>("哈哈"),value);
out.collect(value);
}
@Override
public void processElement2(Tuple2 value, Context ctx, Collector> out) throws Exception {
out.collect(Tuple3.of(value.f0,value.f1,1000L));
}
});
ds.addSink(new PrintSinkFunction<>("这是我的自定义输出:", false));
env.execute("TestAggFunctionOnWindow");
}
public static final Tuple3[] ONE = new Tuple3[] {
Tuple3.of("一班", "张三", 1L),
Tuple3.of("一班", "李四", 2L),
Tuple3.of("一班", "王五", 3L)
};
public static final Tuple2[] TWO = new Tuple2[]{
Tuple2.of("二班", "赵六"),
Tuple2.of("二班", "小七"),
Tuple2.of("二班", "小八")
};
}
结果:
3> 你
4> a
4> b
4> c
4> d
3> 吃
3> 西
3> 瓜
6> 我
6> 爱
6> 小
6> 花
7> 北
7> 京
7> 大
7> 小
7> 学
上面的功能也能用connectStream.process(CoProcessFunction<>)实现,参考上面的connect代码。
对数据流进行迭代处理,直到满足条件输出,不满足条件接着获取迭代处理。
public class interateDemo {
private static final Logger logger = LoggerFactory.getLogger(interateDemo.class);
public static void main(String[] args) throws Exception {
logger.info("程序开始运行....");
// 获取执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); // loop流要求和元数据的并行度一致,否则报错。
//fromElements产生的并行度为1且无法
//更改,因此这里全局设置并行度为1
// 读取数据
DataStreamSource> initialStream = env.fromElements(ONE);
IterativeStream> iteration = initialStream.iterate();
DataStream > iterationBody = iteration.map(/*do something*/new MapFunction, Tuple2>() {
@Override
public Tuple2 map(Tuple2 value) throws Exception {
// System.out.println("aaaaa");
Long age = value.f1 -1;
return Tuple2.of(value.f0,age);
}
});
// 过滤出流数据中>0 的数据,此时意味着数据>0则不往下流动, <=0的时候才会突破loop循环
DataStream> loop = iterationBody.filter(new FilterFunction>() {
@Override
public boolean filter(Tuple2 value) throws Exception {
// System.out.println("bbbbbbbbbbbbbb");
return value.f1 > 0;
}
});
//定义终结条件
iteration.closeWith(loop);
//output
iterationBody.filter(new FilterFunction>(){
@Override
public boolean filter(Tuple2 value) throws Exception {
return value.f1 <= 0;
}
}).print();
iteration.print();
env.execute("TestAggFunctionOnWindow");
}
private static final Tuple2[] ONE = new Tuple2[] {
Tuple2.of("张三", 1L),
Tuple2.of("李四", 2L),
Tuple2.of("王二", 3L),
};
}
说明:中间每次map处理的结果并没有丢失,每次中间的结果都被存储在了iterationBody中。
结果:
(张三,0)
(李四,0)
(王二,0)
分区的意思就是并行度,也可以理解为处理数据的线程,每个算子处理数据的时候,会根据设置的分区的数量进行数据的拆分,从而每个分区激活一个subTask线程,一个线程处理一个分区。
根据数据中的选定的字段进行分区。
import org.apache.flink.api.common.functions.Partitioner;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class CustomPartion {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// env.setParallelism(3);
DataStream> dataStream = env.fromElements(
new Tuple2<>("北京", 1L),
new Tuple2<>("上海", 3L),
new Tuple2<>("天津", 5L),
new Tuple2<>("河南", 99L));
DataStream> tuple2DataStream = dataStream.partitionCustom(
(Partitioner) (key, numPartitions) -> key.hashCode() % 4, //根据分区字段设定分区逻辑
(KeySelector, String>) value -> value.f0);//0将Tuple2的第一个字段作为分区字段
tuple2DataStream.print();
env.execute();
}
}
举例来说,假如A算子作为上游算子,有4个SubTask,并行度为4;下游B算子,有2个SubTask,并行度为2,数据传递方式是rebalance。数据具体传递形式:首先生成一个随机数,决定第一个数据发往下游的哪个subtask,假如生成随机是i,下游的任务数是n,则A的SubTask1中第一个数据发送到B的第(i+1)%n的subtask,执行i=(i+1)%n, A的SubTask1第二个数据发送到B的第(i+1)%n的subtask,i=(i+1)%n,从而轮询发送数据。同理,A的SubTask2也是如此。(之所以i要加1,因为分区从1开始,而随机数有可能是0)
当上下游算子并行度不一样时,默认的数据传递方式是rebalance,当下游算子并行度一样时,默认的数据传递方式是forward。
forward也是flink中的算子,因为它只是让数据在当前的分区进行上下游传递,并没有进行shuffle,所以不属于shuffle类的算子。
rescale :DataStream -> DataStream,重新分组,在组内进行rebalance(轮询),数据传输的范围小一点。
如下图所示,假如上游有2个分区(即两个subtask),下游4个分区,rebalance是让每一个上游subtask对下游轮询发送数据,而rescale是将上下游分区的任务平均划分为2组,在每个分组内rebalance发送数据。
原文链接
完全随机发送数据,也就是说,上游任务发送给下游任务的数据是随机发送的。
shuffle的底层是ShufflePartitioner
全局分区也是一种特殊的分区方式,通过调用.global()方法,会将所有的输入流数据都发送到下游算子的第一个并行子任务中去。这就相当于强行让下游任务并行度变成了 1,所以使用这个操作需要非常谨慎,可能对程序造成很大的压力。
注意注意:广播分区,不同于广播变量,广播变量也是用的这个参数,只不过广播分区没有参数。
给下游算子所有的subtask都广播一份数据,记住是每个并行线程都获取到了这个数据,也就意味着会重复处理。目前我还没发现这玩意有什么实际用处。
public class tt {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// env.setParallelism(4);
DataStreamSource d1 = env.fromElements("name");
d1.broadcast().map(new MapFunction() {
@Override
public String map(String value) throws Exception {
return value;
}
}).print();
env.execute();
}
}
打印:
3> name
4> name
2> name
1> name
分析:代码中并行度设置为4,会有四核处理,也就是四个线程subTask处理,每个线程打印的时候都会打印出这个数据,这就是数据重复。
有哪位小伙伴直到这玩意儿实际用处请留言。