返回
DataStream Transformations基础
Map
定义
Transformation | Description |
---|---|
DataStream → DataStream | Applies a Map transformation on a DataStream . The transformation calls a MapFunction for each element of the DataStream. Each MapFunction call returns exactly one element. The user can also extend RichMapFunction to gain access to other features provided by the RichFunction interface. |
说明
利用map
方法对每个数据进行一个转换,输入数据和输出数据为1对1的关系
样例
代码
public class MapDemo {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(1);
DataStream myInts = env.fromElements(1, 2, 3, 4, 5)
.map(x -> x + 1);
myInts.print("map");
env.execute("Map Demo");
}
}
输出结果
map> 2
map> 3
map> 4
map> 5
map> 6
说明
通过map
将每个元素加1后输出
FlatMap
定义
Transformation | Description |
---|---|
DataStream → DataStream | Applies a FlatMap transformation on a DataStream . The transformation calls a FlatMapFunction for each element of the DataStream. Each FlatMapFunction call can return any number of elements including none. The user can also extend RichFlatMapFunction to gain access to other features provided by the RichFunction interface. |
说明
利用flatMap
方法对每个数据进行一个转换,输入数据和输出数据为1对n的关系,n大于等于0。此方法可以将一行数据拆成多行。
样例
代码
public class FlatMapDemo {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(1);
DataStream myStr = env.fromElements("1, 2, 3, 4, 5")
.flatMap(new FlatMapFunction() {
@Override
public void flatMap(String value, Collector out) throws Exception {
for (String word : value.split(",")) {
out.collect(word.trim());
}
}
});
myStr.print("flatMap");
env.execute("FlatMap Demo");
}
}
输出结果
flatMap> 1
flatMap> 2
flatMap> 3
flatMap> 4
flatMap> 5
说明
通过flatMap
将一行数据按照分隔符分割成多行输出
Filter
定义
Transformation | Description |
---|---|
DataStream → DataStream | Applies a Filter transformation on a DataStream . The transformation calls a FilterFunction for each element of the DataStream and retains only those element for which the function returns true. Elements for which the function returns false are filtered. The user can also extend RichFilterFunction to gain access to other features provided by the RichFunction interface. |
说明
利用filter
方法进行数据的过滤
样例
代码
public class FilterDemo {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(1);
DataStream myInts = env.fromElements(1, 2, 3, 4, 5)
.filter(new FilterFunction() {
@Override
public boolean filter(Integer value) throws Exception {
return value > 3;
}
});
myInts.print("filter");
env.execute("Filter Demo");
}
}
输出结果
filter> 4
filter> 5
说明
通过filter
方法,只输出值大于3的数据
KeyBy
定义
Transformation | Description |
---|---|
DataStream → KeyedStream | Logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition. Internally, keyBy() is implemented with hash partitioning. |
说明
利用KeyBy
方法可以把相同key的数据放到同一个逻辑分区中
样例
代码
public class KeyByDemo {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(2);
DataStream> input = env.fromElements(new Tuple2<>("foo", 1), new Tuple2<>("foo", 2),
new Tuple2<>("bar", 3), new Tuple2<>("baz", 4),
new Tuple2<>("bar", 4), new Tuple2<>("baz", 5));
input.print("Input data");
KeyedStream, Tuple> keyed = input.keyBy(0);
keyed.print("Keyed data");
env.execute("KeyBy Demo");
}
}
输出结果
Keyed data:1> (foo,1)
Input data:2> (foo,1)
Keyed data:2> (bar,3)
Input data:1> (foo,2)
Keyed data:2> (baz,4)
Input data:2> (bar,3)
Keyed data:1> (foo,2)
Keyed data:2> (bar,4)
Input data:1> (baz,4)
Input data:2> (bar,4)
Keyed data:2> (baz,5)
Input data:1> (baz,5)
说明
从结果中可以看到例如key为foo的数据并不在同一个逻辑分区中,通过keyBy把foo为key的数据都放到了逻辑分区1中,bar为key的数据放到了逻辑分区2中
Reduce
定义
Transformation | Description |
---|---|
KeyedStream → DataStream | Applies a reduce transformation on the grouped data stream grouped on by the given key position. The ReduceFunction will receive input values based on the key value. Only input values with the same key will go to the same reducer. |
说明
基于KeyedStream,对相同的key的数据进行reduce操作
样例
代码
public class ReduceDemo {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(2);
DataStream> input = env.fromElements(new Tuple2<>("foo", 1), new Tuple2<>("foo", 2),
new Tuple2<>("bar", 3), new Tuple2<>("baz", 4),
new Tuple2<>("bar", 4), new Tuple2<>("baz", 5));
KeyedStream, Tuple> keyed = input.keyBy(0);
DataStream> out = keyed.reduce(new ReduceFunction>() {
@Override
public Tuple2 reduce(Tuple2 value1, Tuple2 value2) throws Exception {
return new Tuple2<>(value1.f0, value1.f1 + value2.f1);
}
});
out.print("reduce");
env.execute("Reduce Demo");
}
}
输出结果
reduce:2> (bar,3)
reduce:1> (foo,1)
reduce:2> (baz,4)
reduce:2> (bar,7)
reduce:2> (baz,9)
reduce:1> (foo,3)
说明
先将数据按照key进行分组,然后将每个key出现的次数累加
Aggregations
定义
Transformation | Description |
---|---|
KeyedStream → DataStream | Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy). |
说明
根据key统计最大值最小值等聚合计算
样例
代码
public class AggregationsDemo {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setParallelism(1);
DataStream> input = env.fromElements(
new Tuple3<>("foo", 8, 1573441871000L), new Tuple3<>("foo", 2, 1573441872000L),
new Tuple3<>("foo", 9, 1573441873000L), new Tuple3<>("foo", 4, 1573441874000L),
new Tuple3<>("foo", 17, 1573441875000L), new Tuple3<>("foo", 5, 1573441876000L),
new Tuple3<>("foo", 17, 1573441877000L));
KeyedStream, Tuple> keyed = input.keyBy(0);
//keyed.sum(1).print("sum");
//keyed.min(1).print("min");
//keyed.minBy(1).print("minBy");
keyed.max(1).print("max");
keyed.maxBy(1, true).print("maxBy");
keyed.maxBy(1, false).print("maxBy");
env.execute("Aggregations Demo");
}
}
输出结果
max的输出:
max> (foo,8,1573441871000)
max> (foo,8,1573441871000)
max> (foo,9,1573441871000)
max> (foo,9,1573441871000)
max> (foo,17,1573441871000)
max> (foo,17,1573441871000)
max> (foo,17,1573441871000)
maxBy的第二个参数true输出:
maxBy> (foo,8,1573441871000)
maxBy> (foo,8,1573441871000)
maxBy> (foo,9,1573441873000)
maxBy> (foo,9,1573441873000)
maxBy> (foo,17,1573441875000)
maxBy> (foo,17,1573441875000)
maxBy> (foo,17,1573441875000)
maxBy的第二个参数false输出:
maxBy> (foo,8,1573441871000)
maxBy> (foo,8,1573441871000)
maxBy> (foo,9,1573441873000)
maxBy> (foo,9,1573441873000)
maxBy> (foo,17,1573441875000)
maxBy> (foo,17,1573441875000)
maxBy> (foo,17,1573441877000)
说明
- max只会返回最大值,可以看到max输出结果中Tuple的第三个值并没有随着max的值而更新,可以理解为max方法只会更新最大值
- maxBy返回的是整个数据,也就是把包含最大值的这个条数据完整的输出,第二个参数表示如果值一样是不是输出第一条,如果为true,表示只保留相同值的第一条数据,不会被后面的值覆盖;false表示当要比较的值一样的情况,后面的值会覆盖前面的值