数据源是程序读取数据的来源,⽤户可以通过 env.addSource(SourceFunction)
,将SourceFunction
添加到程序中。Flink内置许多已知实现的SourceFunction
,但是⽤户可以⾃定义实现SourceFunction
(⾮并⾏化的接⼝)接⼝或者实现ParallelSourceFunction
(并⾏化)接⼝,如果需要有状态管理还可以继承RichParallelSourceFunction .
readTextFile(path)
- 逐行读取文本文件,即符合TextInputFormat规范的文件,并将其作为字符串返回。//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text:DataStream[String] = env.readTextFile("hdfs://CentOS:9000/demo/words")
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
val text:DataStream[String] =
env.readFile(inputFormat,"hdfs://CentOS:9000/demo/words")
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
-readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)
这是前两个内部调用的方法。它根据给定的fileInputFormat读取路径中的文件。根据提供的watchType,此源可能会定期(每间隔ms)监视路径以获取新数据,(FileProcessingMode.PROCESS_CONTINUOUSLY)
,或处理一次路径中当前存在的数据并退(FileProcessingMode.PROCESS_ONCE)
。使用pathFilter,用户可以进一步将文件排除在处理范围之外。
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
val text:DataStream[String] = env.readFile(inputFormat,
"hdfs://CentOS:9000/demo/words",FileProcessingMode.PROCESS_CONTINUOUSLY,1000)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
该⽅法会检查采集⽬录下的⽂件,如果⽂件发⽣变化系统会重新采集。此时可能会导致⽂件的重复计算。⼀般来说不建议修改⽂件内容,直接上传新⽂件即可
socketTextStream
- Reads from a socket. Elements can be separated by a delimiter.//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.socketTextStream("CentOS", 9999,'\n',3)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
可以读取一个集合的数据
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.fromCollection(List("this is a demo","hello word"))
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
用户可以自定义一些数据输入源
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.util.Random
class UserDefinedNonParallelSourceFunction extends SourceFunction[String]{
@volatile //防⽌线程拷⻉变量
var isRunning:Boolean=true
val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
//在该⽅法中启动线程,通过sourceContext的collect⽅法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
while(isRunning){
Thread.sleep(100)
//输送数据给下游
sourceContext.collect(lines(new Random().nextInt(lines.size)))
}
}
//释放资源
override def cancel(): Unit = {
isRunning=false
}
}
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction,
SourceFunction}
import scala.util.Random
class UserDefinedParallelSourceFunction extends ParallelSourceFunction[String]{
@volatile //防⽌线程拷⻉变量
var isRunning:Boolean=true
val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
//在该⽅法中启动线程,通过sourceContext的collect⽅法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
while(isRunning){
Thread.sleep(100)
//输送数据给下游
sourceContext.collect(lines(new Random().nextInt(lines.size)))
}
}
//释放资源
override def cancel(): Unit = {
isRunning=false
}
}
测试
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.addSource[String](⽤户定义的SourceFunction)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
println(env.getExecutionPlan) //打印执⾏计划
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
org.apache.flink</groupId>
flink-connector-kafka_2.11</artifactId>
1.10.0</version>
</dependency>
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
val text = env.addSource(new FlinkKafkaConsumer[String]("topic01",new
SimpleStringSchema(),props))
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
package com.zb.datasorce
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.flink.api.scala._
/**
* 用于序列化后 获取到消息的 key value partition offset
*/
class MyKafkaDeserializationSchema extends KafkaDeserializationSchema[(String, String, Int, Long)] {
//是否终断流
override def isEndOfStream(t: (String, String, Int, Long)): Boolean = false
//获取到值
override def deserialize(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]]): (String, String, Int, Long) = {
//如果key不为null 则获取
if (consumerRecord.key() != null) {
(new String(consumerRecord.key()),new String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
} else {
//为null 则为返回空key
("",new String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
}
}
override def getProducedType: TypeInformation[(String, String, Int, Long)] = {
createTypeInformation[(String, String, Int, Long)]
}
}
测试
package com.zb.datasorce
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
object KafkaDataSource02 {
def main(args: Array[String]): Unit = {
//创建流读取环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//Kafka的配置
val properties = new Properties()
properties.setProperty("bootstrap.servers", "CentOS:9092")
// 仅 Kafka 0.8 需要
properties.setProperty("zookeeper.connect", "CentOS:2181")
properties.setProperty("group.id", "g1")
//2. 添加输入源
val text = env.addSource(new FlinkKafkaConsumer[(String,String,Int,Long)]("topic01", new MyKafkaDeserializationSchema(), properties))
//3.执行DataStream的转换算子
val counts = text.flatMap(line=>{
println(line)
line._2.split("\\s+")})
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.print()
//5.执行流计算任务
env.execute("Window Stream WordCount")
}
}
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
//{"id":1,"name":"zhangsan"}
val text = env.addSource(new FlinkKafkaConsumer[ObjectNode]("topic01",new
JSONKeyValueDeserializationSchema(true),props))
//t:{"value":{"id":1,"name":"zhangsan"},"metadata":{"offset":0,"topic":"topic01","partition":13}}
text.map(t=> (t.get("value").get("id").asInt(),t.get("value").get("name").asText()))
.print()
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
具体参考:https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html
Data Sink使⽤DataStreams并将其转发到⽂件,Socket,外部系统或打印它们。 Flink带有多种内置输出格式,这些格式封装在DataStreams的操作后⾯。
TextOutputFormat
-将元素按行写为字符串。通过调用每个元素的toString()方法获得字符串。CsvOutputFormat
将元组写为以逗号分隔的值文件。行和字段定界符是可配置的。每个字段的值来自对象的toString()方法。FileOutputFormat
自定义文件输出的方法和基类。支持自定义对象到字节的转换。请注意DataStream上的write*()⽅法主要⽤于调试⽬的。
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.将计算的结果在控制打印
counts.writeUsingOutputFormat(new TextOutputFormat[(String, Int)](new
Path("file:///Users/admin/Desktop/flink-results")))
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
注意事项:如果改成HDFS,需要⽤户⾃⼰产⽣⼤量数据,才能看到测试效果,原因是因为HDFS⽂件系统写⼊时的缓冲区⽐较⼤。以上写⼊⽂件系统的Sink不能够参与系统检查点,如果在⽣产环境下通常使⽤flink-connector-filesystem写⼊到外围系统。
>
>org.apache.flink >
>flink-connector-filesystem_2.11 >
>1.10.0 >
>
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var bucketingSink=StreamingFileSink.forRowFormat(new Path("hdfs://CentOS:9000/bucket-results"),
new SimpleStringEncoder[(String,Int)]("UTF-8"))
.withBucketAssigner(new DateTimeBucketAssigner[(String, Int)]("yyyy-MM-dd"))//动态产⽣写⼊路径
.build()
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(bucketingSink)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
⽼版本写法
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(bucketingSink)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
打印标准输出/标准错误流中每个元素的toString()值。可选地,可以提供前缀(msg)作为输出的前缀。例如:print(“mag”),这样可以区分信息,且可调试。如果并行度大于1,输出则为并行度的区的数量
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.printToErr("测试").setParallelism(2)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
用户可以根据需求自定义Sink,需要继承s RichSinkFunction
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
class UserDefinedSinkFunction extends RichSinkFunction[(String,Int)]{
override def open(parameters: Configuration): Unit = {
println("打开链接...")
}
override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit =
{
println("输出:"+value)
}
RedisSink
参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/
override def close(): Unit = {
println("释放连接")
}
}
测试
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(new UserDefinedSinkFunction)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/
<dependency>
<groupId>org.apache.bahirgroupId>
<artifactId>flink-connector-redis_2.11artifactId>
<version>1.0version>
dependency>
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var flinkJeidsConf = new FlinkJedisPoolConfig.Builder()
.setHost("CentOS")
.setPort(6379)
.build()
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
√Kafka集成
⽅案1
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(new RedisSink(flinkJeidsConf,new UserDefinedRedisMapper()))
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
Sink
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand,RedisCommandDescription,RedisMapper}
class UserDefinedRedisMapper extends RedisMapper[(String,Int)]{
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET,"wordcounts")
}
override def getKeyFromData(data: (String, Int)): String = data._1
override def getValueFromData(data: (String, Int)): String = data._2+""
}
所需依赖
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-connector-kafka_2.11artifactId>
<version>1.10.0version>
dependency>
第一种
import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema
import org.apache.kafka.clients.producer.ProducerRecord
class UserDefinedKafkaSerializationSchema extends KafkaSerializationSchema[(String,Int)]{
override def serialize(element: (String, Int), timestamp: lang.Long):ProducerRecord[Array[Byte], Array[Byte]]= {
return new ProducerRecord("topic01",element._1.getBytes(),element._2.toString.getBytes())
}
}
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
val props = new Properties()
props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
//Semantic.EXACTLY_ONCE:开启kafka幂等写特性
//Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
val kafakaSink = new FlinkKafkaProducer[(String, Int)]("defult_topic",
new
UserDefinedKafkaSerializationSchema, props,
Semantic.AT_LEAST_ONCE)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(kafakaSink)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
以上的 defult_topic 没有任何意义
第二种
class UserDefinedKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
override def serializeKey(element: (String, Int)): Array[Byte] = {
element._1.getBytes()
}
override def serializeValue(element: (String, Int)): Array[Byte] = {
element._2.toString.getBytes()
}
//可以覆盖 默认是topic,如果返回null,则将数据写⼊到默认的topic中
override def getTargetTopic(element: (String, Int)): String = {
null
}
}
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
val props = new Properties()
props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
//Semantic.EXACTLY_ONCE:开启kafka幂等写特性
//Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
val kafakaSink = new FlinkKafkaProducer[(String, Int)]("defult_topic",
new
UserDefinedKeyedSerializationSchema, props, Semantic.AT_LEAST_ONCE)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(kafakaSink)
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
Map
获取一个元素并生成一个元素。一个映射函数,使输入流的值加倍
dataStream.map { x => x * 2 }
FlatMap
获取一个元素并生成零个、一个或多个元素。一个平面图功能,将句子分割成单词:
dataStream.flatMap { str => str.split(" ")}
Filter
对每个元素求布尔函数的值,并保留函数返回true的元素:
dataStream.filter { _ != 0 }
Union
两个或多个数据流的并集,创建包含来自所有流的所有元素的新流
注意:如果您将一个数据流与它自己相结合,您将得到结果流中的每个元素两次
dataStream.union(otherStream1, otherStream2, ...)
DataStream,DataStream → ConnectedStreams
connect:“连接”两个数据流,保持它们的类型,允许在两个流之间共享状态
someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...
val connectedStreams = someStream.connect(otherStream)
ConnectedStreams → DataStream
CoMap, CoFlatMap
类似于连接数据流上的映射和平面映射
connectedStreams.map(
(_ : Int) => true,
(_ : String) => false
)
connectedStreams.flatMap(
(_ : Int) => true,
(_ : String) => false
)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text1 = env.socketTextStream("CentOS", 9999)
val text2 = env.socketTextStream("CentOS", 8888)
text1.connect(text2)
.flatMap((line:String)=>line.split("\\s+"),(line:String)=>line.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.print("总数")
env.execute("Stream WordCount")
Split 根据某些标准将流分成两个或多个流。
val split = someDataStream.split(
(num: Int) =>
(num % 2) match {
case 0 => List("even")
case 1 => List("odd")
}
)
Select 可在拆分流中 选择输出 分支流
val even = split.select("even")
val odd = split.select("odd")
val all = split.select("even","odd")
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text1 = env.socketTextStream("CentOS", 9999)
var splitStream= text1.split(line=> {
if(line.contains("error")){
List("error")
} else{
List("info")
}
})
splitStream.select("error").printToErr("错误")
splitStream.select("info").print("信息")
splitStream.select("error","info").print("All")
env.execute("Stream WordCount")
PrcoessFunction
⼀般来说,更多使⽤PrcoessFunctio完成流的分⽀。
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
val errorTag = new OutputTag[String]("error")
val allTag = new OutputTag[String]("all")
val infoStream = text.process(new ProcessFunction[String, String] {
override def processElement(value: String,ctx: ProcessFunction[String, String]#Context,out: Collector[String]): Unit = {
if (value.contains("error")) {
ctx.output(errorTag, value) //边输出
} else {
out.collect(value) //正常数据
}
ctx.output(allTag, value) //边输出
}
})
infoStream.getSideOutput(errorTag).printToErr("错误")
infoStream.getSideOutput(allTag).printToErr("所有")
infoStream.print("正常")
env.execute("Stream WordCount")
keyBy
逻辑上将一个流划分为不相连的分区,每个分区包含相同键的元素。在内部,这是通过哈希分区实现的
dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple
Reduce
KeyedStream数据流上的“滚动”减少。将当前元素与最后一个减少的值组合在一起,并发出新的值。
keyedStream.reduce(_ + _)
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lines = env.socketTextStream("CentOS", 9999)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy("_1")
.reduce((v1,v2)=>(v1._1,v1._2+v2._2))
.print()
env.execute("Stream WordCount")
Fold
具有初始值的KeyedStream数据流上的“滚动”折叠。将当前元素与最后一个折叠值组合并发出新值。当作用于序列(1,2,3,4,5)时,发出序列“start-1”、“start-1-2”、“start-1-2”、“start-1- 3”、…
val result: DataStream[String] =
keyedStream.fold("start")((str, i) => { str + "-" + i })
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lines = env.socketTextStream("CentOS", 9999)
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy("_1")
.fold((null:String,0:Int))((z,v)=>(v._1,v._2+z._2))
.print()
env.execute("Stream WordCount")
Aggregations
在 keyedStream 数据流上做滚动聚合 minBy和min的关系 min是返回最小值,而minBy返回该字段中具有最小值的元素(max和maxBy也是如此)。
keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")
val env = StreamExecutionEnvironment.getExecutionEnvironment
//zhangsan 研发部 1000
//lisi 研发部 ww 销售部 90005000
//ww 销售部 9000
val lines = env.socketTextStream("CentOS", 9999)
lines.map(line=>line.split(" "))
.map(ts=>Emp(ts(0),ts(1),ts(2).toDouble))
.keyBy("dept")
.maxBy("salary")//Emp(lisi,研发部,5000.0)
.print()
env.execute("Stream WordCount")
如果使⽤时max,则返回的是Emp(zhangsan,研发部,5000.0)
Flink还通过以下function对转换后的DataStream进⾏分区(如果需要)
dataStream.rebalance()
dataStream.shuffle()
dataStream.rescale()
dataStream.broadcast
dataStream.partitionCustom(partitioner, "someKey")
dataStream.partitionCustom(partitioner, 0)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("CentOS", 9999)
.map((_,1))
.partitionCustom(new Partitioner[String] {
override def partition(key: String, numPartitions: Int): Int = {
key.hashCode & Integer.MAX_VALUE % numPartitions
}
},_._1)
.print()
.setParallelism(4)
println(env.getExecutionPlan)
env.execute("Stream WordCount")
对两个⼦操作进⾏Chain,意味着将这两个 算⼦放置⼦⼀个线程中,这样是为了没必要的线程开销,提升性能。如果可能的话,默认情况下Flink会链接运算符。例如⽤户可以调⽤:
StreamExecutionEnvironment.disableOperatorChaining()
禁⽤chain⾏为,但是不推荐。
someStream.filter(...).map(...).startNewChain().map(...)
someStream.map(...).disableChaining()
所有操作符禁⽌和map操作符进⾏chain
someStream.filter(...).slotSharingGroup("name")