- 1.Map[DataStream -> DataStream]
调用用户定义的MapFunction对DataStream[T]数据进行处理,形成新的DataStream[T],其中数据格式可能会发生变化,常用作对数据集内数据的清洗和转换。
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
object SourceTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val dataStream = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
//map操作
val mapStream: DataStream[(String,Int)] = dataStream.map(t => (t._1, t._2 + 1))
//MapFunction操作
mapStream.map(new MapFunction[(String,Int),(String,Int)] {
override def map(t: (String, Int)): (String, Int) = {
(t._1,t._2 + 1)
}
})
mapStream.print()
env.execute("SourceTest")
}
}
- 2.FlatMap[DataStream -> DataStream]
主要对输入的元素处理之后生成一个或者多个元素
object SourceTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
testFlatMap(env)
env.execute("SourceTest")
}
def testFlatMap(env: StreamExecutionEnvironment): Unit = {
val dataStream = env.fromElements("dfsdf sdf","34343 dds")
val flatMapStream: DataStream[String] = dataStream.flatMap(t => t.split(" "))
flatMapStream.print()
}
}
结果将输出:
1> 34343
4> dfsdf
1> dds
4> sdf
- 3.Filter[DataStream -> DataStream]
该算子将按照条件对输入数据集进行筛选操作,将符合条件的数据集输出,将不符合条件的数据过滤掉
def testFilter(env: StreamExecutionEnvironment): Unit = {
val dataStream = env.fromElements(1,2,3,4,5)
val filterStream = dataStream.filter(_%2==0)
filterStream.print()
}
这里过滤对2取模为0的数字,输出结果为:
3> 4
1> 2
- 4.KeyBy[DataStream -> KeyedStream]
该算子根据指定的key将输入的DataStream[T]数据格式转换为KeyedStream[T],也就是在数据集中执行Partition操作,将相同的key值的数据放置在相同的分区中。简单来说,就是sql里面的group by
def testKeyby(env: StreamExecutionEnvironment): Unit = {
val dataStream = env.fromElements(("a",1),("a",2),("b",3),("c",3))
val keyedStream = dataStream.keyBy(0) //keyby之后还看不到效果,需要借助聚合函数
val sumStream: DataStream[(String,Int)] = keyedStream.sum(1) //对第二列相加
sumStream.print()
}
输出结果为:
3> (a,1)
3> (a,3)
2> (c,3)
1> (b,3)
很明显,a被分区之后累加变成了3
- 5.Reduce[KeyedStream -> DataStream]
该算子和MapReduce的Reduce原理基本一致,主要目的是将输入的KeyedStream通过传入的用户自定义的ReduceFunction滚动的进行数据聚合处理,其中定义的ReduceFunction必须满足运算结合律和交换律
//按照第一列分区,然后统计每个分区中的总和
def testReduce(env: StreamExecutionEnvironment): Unit = {
val dataStream = env.fromElements(("a", 1), ("b",2), ("b",3),("c",1))
val keyedStream = dataStream.keyBy(0)
val reduceStream = keyedStream.reduce(new ReduceFunction[(String, Int)] {
override def reduce(t: (String, Int), t1: (String, Int)): (String, Int) = {
(t._1, t._2 + t1._2)
}
})
reduceStream.print()
}
输出结果为:
2> (c,1)
1> (b,2)
1> (b,5)
3> (a,1)
在结果里,b最终变成了5
- 6.Union[DataStream -> DataStream]
将两个或者多个输入的数据集合并成一个数据集,需要保证两个数据集的格式一致,输出的数据集的格式和输入的数据集格式保持一致
def testUnion(env: StreamExecutionEnvironment): Unit = {
val dataStream1: DataStream[(String,Int)] = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
val dataStream2: DataStream[(String,Int)] = env.fromElements(("d",1),("s",2),("a",4),("e",5),("a",6))
val dataStream3: DataStream[(String,Int)] = env.fromElements(("a",2),("d",1),("s",2),("c",3),("b",1))
val unionStream = dataStream1.union(dataStream2)
val allUnionStream = dataStream1.union(dataStream2, dataStream3)
unionStream.print()
// allUnionStream.print()
}
- 7.Connect, CoMap, CoFlatMap[DataStream -> DataStream]
Connect算子主要是为了合并两种后者多种不同数据类型的数据集,合并后悔保留原来的数据集的数据类型。连接操作允许共享状态数据,也就是说在多个数据集之间可以操作和查看对方数据集的状态。
def testConnect(env: StreamExecutionEnvironment): Unit = {
val dataStream1: DataStream[(String,Int)] = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
val dataStream2: DataStream[Int] = env.fromElements(1,2,3,4,5,6)
val connectedStream: ConnectedStreams[(String,Int), Int] = dataStream1.connect(dataStream2)
//Connect Stream不能直接print,需要转换成DataStream,两个函数会多线程交替执行产生结果,最终将两个数据集根据定义合并成目标数据集
val resultStream = connectedStream.map(new CoMapFunction[(String,Int), Int, (Int,String)] {
override def map1(in1: (String, Int)): (Int, String) = {
(in1._2, in1._1)
}
override def map2(in2: Int): (Int, String) = {
(in2, "default")
}
})
resultStream.print()
}
输出结果为:
4> (2,default)
4> (6,default)
3> (1,default)
2> (4,default)
3> (5,default)
1> (3,default)
4> (3,a)
2> (2,c)
1> (4,d)
3> (5,c)
4> (5,a)
- 8.Split[DataStream -> SplitStream]
Split是将一个DataStream数据集按照条件进行拆分,形成两个数据集的过程
def testSplit(env: StreamExecutionEnvironment): Unit = {
val dataStream1: DataStream[(String,Int)] = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
val splitedStream: SplitStream[(String,Int)] = dataStream1.split(t => if (t._2 %2 == 0) Seq("even") else Seq("odd"))
}
split stream要打印,需要转换成DataStream,使用select方法
def testSplit(env: StreamExecutionEnvironment): Unit = {
val dataStream1: DataStream[(String,Int)] = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
val splitedStream: SplitStream[(String,Int)] = dataStream1.split(t => if (t._2 %2 == 0) Seq("even") else Seq("odd"))
val evenStream: DataStream[(String,Int)] = splitedStream.select("even")
val oddStream: DataStream[(String,Int)] = splitedStream.select("odd")
evenStream.print("even")
oddStream.print("odd")
}