Datastream -> Datasteam
Takes one element and produces one element. A map function that doubles the values of the input stream:
dataStream.map { x => x * 2 }
Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words:
dataStream.flatMap { str => str.split(" ") }
Evaluates a boolean function for each element and retains those for which the function returns true. A filter that filters out zero values:
dataStream.filter { _ != 0 }
Union of two or more data streams creating a new stream containing all the elements from all the streams. Note: If you union a data stream with itself you will get each element twice in the resulting stream.
dataStream.union(otherStream1, otherStream2, ...)
“Connects” two data streams retaining their types, allowing for shared state between the two streams.
someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...
val connectedStreams = someStream.connect(otherStream)
Similar to map and flatMap on a connected data stream
connectedStreams.map(
(_ : Int) => true,
(_ : String) => false
)
connectedStreams.flatMap(
(_ : Int) => true,
(_ : String) => false
)
案例
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val s1 = fsEnv.socketTextStream("Spark",9999)
val s2 = fsEnv.socketTextStream("Spark",8888)
s1.connect(s2).flatMap(
(line:String)=>line.split("\\s+"),//s1流转换逻辑
(line:String)=>line.split("\\s+")//s2流转换逻辑
)
.map((_,1))
.keyBy(0)
.sum(1)
.print()
fsEnv.execute("ConnectedStream")
Split the stream into two or more streams according to some criterion.
val split = someDataStream.split(
(num: Int) =>
(num % 2) match {
case 0 => List("even")
case 1 => List("odd")
}
)
Select one or more streams from a split stream.
val even = split select "even"
val odd = split select "odd"
val all = split.select("even","odd")
案例
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val logStream = fsEnv.socketTextStream("Spark",9999)
val splitStream: SplitStream[String] = logStream.split(new OutputSelector[String] {
override def select(out: String): lang.Iterable[String] = {
if (out.startsWith("INFO")) {
val array = new util.ArrayList[String]()
array.add("info")
return array
} else {
val array = new util.ArrayList[String]()
array.add("error")
return array
}
}
})
splitStream.select("info").print("info")
splitStream.select("error").printToErr("error")
fsEnv.execute("ConnectedStream")
用法二(优先)
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val logStream = fsEnv.socketTextStream("Spark",9999)
val errorTag = new OutputTag[String]("error")
val dataStream = logStream.process(new ProcessFunction[String, String] {
override def processElement(line: String,
context: ProcessFunction[String, String]#Context,
collector: Collector[String]): Unit = {
if (line.startsWith("INFO")) {
collector.collect(line)
}else{
context.output(errorTag,line)//分支输出
}
}
})
dataStream.print("正常信息")
dataStream.getSideOutput(errorTag).print("错误信息")
fsEnv.execute("ConnectedStream")
Logically partitions a stream into disjoint partitions, each partition containing elements of the same key. Internally, this is implemented with hash partitioning.
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.
keyedStream.reduce { _ + _ }
A “rolling” fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.
val result: DataStream[String] =
keyedStream.fold("start")((str, i) => { str + "-" + i })
Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy).
keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")
Flink提供了一些分区方案,可供用户选择,分区目的是为了任务之间数据的能够均衡分布。
分区方案 | 说明 |
---|---|
Custom partitioning | 需要用户实现分区策略 dataStream.partitionCustom(partitioner, “someKey”) |
Random partitioning | 将当前的数据随机分配给下游任务 dataStream.shuffle() |
Rebalancing (Round-robin partitioning) | 轮询将上游的数据均分下游任务 dataStream.rebalance() |
Rescaling | 缩放分区数据,例如上游2个并行度/下游4个 ,上游会将1个分区的数据发送给下游前两个分区,后1个分区,会发送下游后两个。 dataStream.rescale() |
Broadcasting | 上游会将分区所有数据,广播给下游的所有任务分区。 dataStream.broadcast() |
连接两个Operator 转换,尝试将两个Operator 转换放置到一个线程当中,可以减少线程消耗,避免不必要的线程通信。用户可以通过 StreamExecutionEnvironment.disableOperatorChaining()
禁用chain操作。
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.filter(line => line.startsWith("INFO"))
.flatMap(_.split("\\s+"))
.map((_,1))
.map(t=>WordPair(t._1,t._2))
.print()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5SGSYtOy-1589507352123)(assets/1573799970593.png)]
为了方便,Flink提供一下算子用于修改chain的行为
算子 | 操作 | 说明 |
---|---|---|
Start new chain | someStream.filter(…).map(…).startNewChain().map(…) | 开启新chain,将当前算子和filter断开 |
Disable chaining | someStream.map(…).disableChaining() | 当前算子和前后都要断开chain操作 |
Set slot sharing group | someStream.filter(…).slotSharingGroup(“name”) | 设置操作任务所属资源Group,影响任务对TaskSlots占用。 |
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream -细化
val dataStream: DataStream[String] = fsEnv.socketTextStream("Spark",9999)
//3.对数据做转换
dataStream.filter(line => line.startsWith("INFO"))
.flatMap(_.split("\\s+"))
.startNewChain()
.slotSharingGroup("g1")
.map((_,1))
.map(t=>WordPair(t._1,t._2))
.print()
fsEnv.execute("FlinkWordCountsQuickStart")