Flink DataStream转换操作大集合

Flink DataStream转换操作

1.Single-DataStream操作

  • 1.Map[DataStream -> DataStream]
    调用用户定义的MapFunction对DataStream[T]数据进行处理,形成新的DataStream[T],其中数据格式可能会发生变化,常用作对数据集内数据的清洗和转换。
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
object SourceTest {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val dataStream = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
    //map操作
    val mapStream: DataStream[(String,Int)] = dataStream.map(t => (t._1, t._2 + 1))
    //MapFunction操作
    mapStream.map(new MapFunction[(String,Int),(String,Int)] {
      override def map(t: (String, Int)): (String, Int) = {
        (t._1,t._2 + 1)
      }
    })
    mapStream.print()

    env.execute("SourceTest")
  }

}


  • 2.FlatMap[DataStream -> DataStream]
    主要对输入的元素处理之后生成一个或者多个元素
object SourceTest {

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment

    testFlatMap(env)

    env.execute("SourceTest")
  }

  def testFlatMap(env: StreamExecutionEnvironment): Unit = {
    val dataStream = env.fromElements("dfsdf sdf","34343 dds")
    val flatMapStream: DataStream[String] = dataStream.flatMap(t => t.split(" "))
    flatMapStream.print()
  }

}

结果将输出:
1> 34343
4> dfsdf
1> dds
4> sdf

  • 3.Filter[DataStream -> DataStream]
    该算子将按照条件对输入数据集进行筛选操作,将符合条件的数据集输出,将不符合条件的数据过滤掉
def testFilter(env: StreamExecutionEnvironment): Unit = {
    val dataStream = env.fromElements(1,2,3,4,5)
    val filterStream = dataStream.filter(_%2==0)
    filterStream.print()
}
  
这里过滤对2取模为0的数字,输出结果为:
3> 4
1> 2

  • 4.KeyBy[DataStream -> KeyedStream]
    该算子根据指定的key将输入的DataStream[T]数据格式转换为KeyedStream[T],也就是在数据集中执行Partition操作,将相同的key值的数据放置在相同的分区中。简单来说,就是sql里面的group by
def testKeyby(env: StreamExecutionEnvironment): Unit = {
    val dataStream = env.fromElements(("a",1),("a",2),("b",3),("c",3))
    val keyedStream = dataStream.keyBy(0)  //keyby之后还看不到效果,需要借助聚合函数
    val sumStream: DataStream[(String,Int)] = keyedStream.sum(1)  //对第二列相加
    sumStream.print()
}
 
输出结果为:
3> (a,1)
3> (a,3)
2> (c,3)
1> (b,3)
很明显,a被分区之后累加变成了3

  • 5.Reduce[KeyedStream -> DataStream]
    该算子和MapReduce的Reduce原理基本一致,主要目的是将输入的KeyedStream通过传入的用户自定义的ReduceFunction滚动的进行数据聚合处理,其中定义的ReduceFunction必须满足运算结合律和交换律
//按照第一列分区,然后统计每个分区中的总和
def testReduce(env: StreamExecutionEnvironment): Unit = {
    val dataStream = env.fromElements(("a", 1), ("b",2), ("b",3),("c",1))
    val keyedStream = dataStream.keyBy(0)
    val reduceStream = keyedStream.reduce(new ReduceFunction[(String, Int)] {
      override def reduce(t: (String, Int), t1: (String, Int)): (String, Int) = {
        (t._1, t._2 + t1._2)
      }
    })
    reduceStream.print()
}

输出结果为:
2> (c,1)
1> (b,2)
1> (b,5)
3> (a,1)
在结果里,b最终变成了5

  • 6.Union[DataStream -> DataStream]
    将两个或者多个输入的数据集合并成一个数据集,需要保证两个数据集的格式一致,输出的数据集的格式和输入的数据集格式保持一致
def testUnion(env: StreamExecutionEnvironment): Unit = {
    val dataStream1: DataStream[(String,Int)] = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
    val dataStream2: DataStream[(String,Int)] = env.fromElements(("d",1),("s",2),("a",4),("e",5),("a",6))
    val dataStream3: DataStream[(String,Int)] = env.fromElements(("a",2),("d",1),("s",2),("c",3),("b",1))

    val unionStream = dataStream1.union(dataStream2)
    val allUnionStream = dataStream1.union(dataStream2, dataStream3)

    unionStream.print()
//    allUnionStream.print()
  }


  • 7.Connect, CoMap, CoFlatMap[DataStream -> DataStream]
    Connect算子主要是为了合并两种后者多种不同数据类型的数据集,合并后悔保留原来的数据集的数据类型。连接操作允许共享状态数据,也就是说在多个数据集之间可以操作和查看对方数据集的状态。
def testConnect(env: StreamExecutionEnvironment): Unit = {
    val dataStream1: DataStream[(String,Int)] = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
    val dataStream2: DataStream[Int] = env.fromElements(1,2,3,4,5,6)

    val connectedStream: ConnectedStreams[(String,Int), Int] = dataStream1.connect(dataStream2)

    //Connect Stream不能直接print,需要转换成DataStream,两个函数会多线程交替执行产生结果,最终将两个数据集根据定义合并成目标数据集
    val resultStream = connectedStream.map(new CoMapFunction[(String,Int), Int, (Int,String)] {
      override def map1(in1: (String, Int)): (Int, String) = {
        (in1._2, in1._1)
      }

      override def map2(in2: Int): (Int, String) = {
        (in2, "default")
      }
    })

    resultStream.print()
  }
  
输出结果为:
4> (2,default)
4> (6,default)
3> (1,default)
2> (4,default)
3> (5,default)
1> (3,default)
4> (3,a)
2> (2,c)
1> (4,d)
3> (5,c)
4> (5,a)


  • 8.Split[DataStream -> SplitStream]
    Split是将一个DataStream数据集按照条件进行拆分,形成两个数据集的过程
def testSplit(env: StreamExecutionEnvironment): Unit = {
    val dataStream1: DataStream[(String,Int)] = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
    val splitedStream: SplitStream[(String,Int)] = dataStream1.split(t => if (t._2 %2 == 0) Seq("even") else Seq("odd"))
    
  }
  


split stream要打印,需要转换成DataStream,使用select方法

  def testSplit(env: StreamExecutionEnvironment): Unit = {
    val dataStream1: DataStream[(String,Int)] = env.fromElements(("a",3),("d",4),("c",2),("c",5),("a",5))
    val splitedStream: SplitStream[(String,Int)] = dataStream1.split(t => if (t._2 %2 == 0) Seq("even") else Seq("odd"))

    val evenStream: DataStream[(String,Int)] = splitedStream.select("even")
    val oddStream: DataStream[(String,Int)] = splitedStream.select("odd")

    evenStream.print("even")
    oddStream.print("odd")
  }

你可能感兴趣的:(Flink,Flink笔记)