Apache Flin之Streaming DataStream API【章节三】

Streaming (DataStream API)

DataSource 数据源

数据源是程序读取数据的来源,⽤户可以通过 env.addSource(SourceFunction) ,将SourceFunction添加到程序中。Flink内置许多已知实现的SourceFunction,但是⽤户可以⾃定义实现SourceFunction(⾮并⾏化的接⼝)接⼝或者实现ParallelSourceFunction(并⾏化)接⼝,如果需要有状态管理还可以继承RichParallelSourceFunction .

File-based

  • readTextFile(path) - 逐行读取文本文件,即符合TextInputFormat规范的文件,并将其作为字符串返回。
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val text:DataStream[String] = env.readTextFile("hdfs://CentOS:9000/demo/words")
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()

//5.执⾏流计算任务 
env.execute("Window Stream WordCount")
  • readFile(fileInputFormat, path) 根据指定的文件输入格式读取(一次)文件
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
val text:DataStream[String] =
env.readFile(inputFormat,"hdfs://CentOS:9000/demo/words")
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

-readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) 这是前两个内部调用的方法。它根据给定的fileInputFormat读取路径中的文件。根据提供的watchType,此源可能会定期(每间隔ms)监视路径以获取新数据,(FileProcessingMode.PROCESS_CONTINUOUSLY),或处理一次路径中当前存在的数据并退(FileProcessingMode.PROCESS_ONCE)。使用pathFilter,用户可以进一步将文件排除在处理范围之外。

//1.创建流计算执⾏环境
 
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
 
val text:DataStream[String] = env.readFile(inputFormat,
 
"hdfs://CentOS:9000/demo/words",FileProcessingMode.PROCESS_CONTINUOUSLY,1000)
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

该⽅法会检查采集⽬录下的⽂件,如果⽂件发⽣变化系统会重新采集。此时可能会导致⽂件的重复计算。⼀般来说不建议修改⽂件内容,直接上传新⽂件即可

Socket Based

  • socketTextStream - Reads from a socket. Elements can be separated by a delimiter.
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val text = env.socketTextStream("CentOS", 9999,'\n',3)
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

Collection-based

可以读取一个集合的数据

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val text = env.fromCollection(List("this is a demo","hello word"))
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

UserDefinedSource

用户可以自定义一些数据输入源

  • SourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.util.Random
class UserDefinedNonParallelSourceFunction extends SourceFunction[String]{
 
@volatile //防⽌线程拷⻉变量
var isRunning:Boolean=true
 
val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
 
//在该⽅法中启动线程,通过sourceContext的collect⽅法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
 
	while(isRunning){
	
		Thread.sleep(100)
 
	//输送数据给下游
	sourceContext.collect(lines(new Random().nextInt(lines.size)))
 	}
 }
 
//释放资源
 
override def cancel(): Unit = {
 
isRunning=false
 }
}
  • ParallelSourceFunction
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction,
SourceFunction}
import scala.util.Random
class UserDefinedParallelSourceFunction extends ParallelSourceFunction[String]{
 
	@volatile //防⽌线程拷⻉变量
 
	var isRunning:Boolean=true
 
	val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
 
	//在该⽅法中启动线程,通过sourceContext的collect⽅法发送数据
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
 
	while(isRunning){
 
		Thread.sleep(100)
 
		//输送数据给下游
 
		sourceContext.collect(lines(new Random().nextInt(lines.size)))
	 }
 }
 
		//释放资源
		override def cancel(): Unit = {
		isRunning=false
	 }
}

测试

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
env.setParallelism(4)
 
//2.创建DataStream - 细化
 
val text = env.addSource[String](⽤户定义的SourceFunction)
 
//3.执⾏DataStream的转换算⼦
 
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
println(env.getExecutionPlan) //打印执⾏计划
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

Kafka集成

  • 引⼊maven

	org.apache.flink</groupId>
	flink-connector-kafka_2.11</artifactId>
	1.10.0</version>
</dependency>
  • SimpleStringSchema
    该SimpleStringSchema⽅案只会反序列化kafka中的value
//1.创建流计算执⾏环境
 
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
 
val props = new Properties()
 
props.setProperty("bootstrap.servers", "CentOS:9092")
 
props.setProperty("group.id", "g1")
 
val text = env.addSource(new FlinkKafkaConsumer[String]("topic01",new
SimpleStringSchema(),props))
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")
  • KafkaDeserializationSchema
package com.zb.datasorce

import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.flink.api.scala._

/**
  * 用于序列化后 获取到消息的 key value partition offset
  */
class MyKafkaDeserializationSchema extends KafkaDeserializationSchema[(String, String, Int, Long)] {
  //是否终断流
  override def isEndOfStream(t: (String, String, Int, Long)): Boolean = false

  //获取到值
  override def deserialize(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]]): (String, String, Int, Long) = {
    //如果key不为null 则获取
    if (consumerRecord.key() != null) {
      (new String(consumerRecord.key()),new String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
    } else {
      //为null 则为返回空key
      ("",new String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
    }
  }
  override def getProducedType: TypeInformation[(String, String, Int, Long)] = {
    createTypeInformation[(String, String, Int, Long)]
  }
}

测试

package com.zb.datasorce

import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer

object KafkaDataSource02 {
  def main(args: Array[String]): Unit = {
    //创建流读取环境
      val env = StreamExecutionEnvironment.getExecutionEnvironment
    //Kafka的配置
    val properties = new Properties()
    properties.setProperty("bootstrap.servers", "CentOS:9092")
    // 仅 Kafka 0.8 需要
    properties.setProperty("zookeeper.connect", "CentOS:2181")
    properties.setProperty("group.id", "g1")
    //2. 添加输入源
    val text = env.addSource(new FlinkKafkaConsumer[(String,String,Int,Long)]("topic01", new MyKafkaDeserializationSchema(), properties))
    //3.执行DataStream的转换算子
    val counts = text.flatMap(line=>{
      println(line)
      line._2.split("\\s+")})
      .map(word=>(word,1))
      .keyBy(0)
      .sum(1)
    //4.将计算的结果在控制打印
    counts.print()

    //5.执行流计算任务
    env.execute("Window Stream WordCount")
  }
}

  • JSONKeyValueNodeDeserializationSchema
    要求Kafka中的topic的key和value都必须是json格式,也可以在使⽤的时候,指定是否读取元数据(topic、分区、o!set等)
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val props = new Properties()
 
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
 
//{"id":1,"name":"zhangsan"}
val text = env.addSource(new FlinkKafkaConsumer[ObjectNode]("topic01",new
JSONKeyValueDeserializationSchema(true),props))
 
//t:{"value":{"id":1,"name":"zhangsan"},"metadata":{"offset":0,"topic":"topic01","partition":13}}
 
text.map(t=> (t.get("value").get("id").asInt(),t.get("value").get("name").asText()))
 .print()
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

具体参考:https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html

Data Sinks 数据输出

Data Sink使⽤DataStreams并将其转发到⽂件,Socket,外部系统或打印它们。 Flink带有多种内置输出格式,这些格式封装在DataStreams的操作后⾯。

File-based

  • writeAsText() / TextOutputFormat-将元素按行写为字符串。通过调用每个元素的toString()方法获得字符串。
  • writeAsCsv(…) / CsvOutputFormat将元组写为以逗号分隔的值文件。行和字段定界符是可配置的。每个字段的值来自对象的toString()方法。
  • writeUsingOutputFormat/ FileOutputFormat自定义文件输出的方法和基类。支持自定义对象到字节的转换。

请注意DataStream上的write*()⽅法主要⽤于调试⽬的。

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val text = env.socketTextStream("CentOS", 9999)
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
//4.将计算的结果在控制打印
counts.writeUsingOutputFormat(new TextOutputFormat[(String, Int)](new
Path("file:///Users/admin/Desktop/flink-results")))
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

注意事项:如果改成HDFS,需要⽤户⾃⼰产⽣⼤量数据,才能看到测试效果,原因是因为HDFS⽂件系统写⼊时的缓冲区⽐较⼤。以上写⼊⽂件系统的Sink不能够参与系统检查点,如果在⽣产环境下通常使⽤flink-connector-filesystem写⼊到外围系统。

>
	>org.apache.flink>
	>flink-connector-filesystem_2.11>
	>1.10.0>
>
  • BucketDataSink
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 
var bucketingSink=StreamingFileSink.forRowFormat(new Path("hdfs://CentOS:9000/bucket-results"),
 							new SimpleStringEncoder[(String,Int)]("UTF-8"))
 			.withBucketAssigner(new DateTimeBucketAssigner[(String, Int)]("yyyy-MM-dd"))//动态产⽣写⼊路径
 			.build()
 
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 
counts.addSink(bucketingSink)
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

⽼版本写法

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 
env.setParallelism(4)
 
//2.创建DataStream - 细化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 
var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucket￾results")
bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
 
//3.执⾏DataStream的转换算⼦
 
val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)

counts.addSink(bucketingSink)
 
//5.执⾏流计算任务
env.execute("Window Stream WordCount")

print() / printToErr()

打印标准输出/标准错误流中每个元素的toString()值。可选地,可以提供前缀(msg)作为输出的前缀。例如:print(“mag”),这样可以区分信息,且可调试。如果并行度大于1,输出则为并行度的区的数量

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(4)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
 bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
 bucketingSink.setBatchSize(1024)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.printToErr("测试").setParallelism(2)
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

UserDefinedSinkFunction

用户可以根据需求自定义Sink,需要继承s RichSinkFunction

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
class UserDefinedSinkFunction extends RichSinkFunction[(String,Int)]{

 override def open(parameters: Configuration): Unit = {
 println("打开链接...")
 }
 override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit =
{
 println("输出:"+value)
 }
RedisSink
参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/
 override def close(): Unit = {
 println("释放连接")
 }
}

测试

//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(1)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
 bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
 bucketingSink.setBatchSize(1024)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.addSink(new UserDefinedSinkFunction)
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

RedisSink

参考:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/

<dependency>
	 <groupId>org.apache.bahirgroupId>
	 <artifactId>flink-connector-redis_2.11artifactId>
	 <version>1.0version>
dependency>
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(1)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 var flinkJeidsConf = new FlinkJedisPoolConfig.Builder()
 .setHost("CentOS")
 .setPort(6379)
 .build()
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
√Kafka集成
⽅案1
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.addSink(new RedisSink(flinkJeidsConf,new UserDefinedRedisMapper()))
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

Sink

import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand,RedisCommandDescription,RedisMapper}

class UserDefinedRedisMapper extends RedisMapper[(String,Int)]{

 override def getCommandDescription: RedisCommandDescription = {
	 new RedisCommandDescription(RedisCommand.HSET,"wordcounts")
 }
 override def getKeyFromData(data: (String, Int)): String = data._1
 override def getValueFromData(data: (String, Int)): String = data._2+""
}

KafkaSink

所需依赖

<dependency>
	 <groupId>org.apache.flinkgroupId>
	 <artifactId>flink-connector-kafka_2.11artifactId>
	 <version>1.10.0version>
dependency>

第一种

import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema
import org.apache.kafka.clients.producer.ProducerRecord
class UserDefinedKafkaSerializationSchema extends KafkaSerializationSchema[(String,Int)]{

 override def serialize(element: (String, Int), timestamp: lang.Long):ProducerRecord[Array[Byte], Array[Byte]]= {
 return new ProducerRecord("topic01",element._1.getBytes(),element._2.toString.getBytes())
 }
}
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(4)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
 val props = new Properties()
 props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
 props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
 props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
 //Semantic.EXACTLY_ONCE:开启kafka幂等写特性
 //Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
 val kafakaSink = new FlinkKafkaProducer[(String, Int)]("defult_topic",
 new
UserDefinedKafkaSerializationSchema, props,
 Semantic.AT_LEAST_ONCE)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.addSink(kafakaSink)
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

以上的 defult_topic 没有任何意义

第二种

class UserDefinedKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{

 override def serializeKey(element: (String, Int)): Array[Byte] = {
	 element._1.getBytes()
 }
 override def serializeValue(element: (String, Int)): Array[Byte] = {
	 element._2.toString.getBytes()
 }
 //可以覆盖 默认是topic,如果返回null,则将数据写⼊到默认的topic中
 override def getTargetTopic(element: (String, Int)): String = {
	 null
 }
}
//1.创建流计算执⾏环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.setParallelism(4)
 //2.创建DataStream - 细化
 val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
  val props = new Properties()
 props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
 props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
 props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
 //Semantic.EXACTLY_ONCE:开启kafka幂等写特性
 //Semantic.AT_LEAST_ONCE:开启Kafka Retries机制
 val kafakaSink = new FlinkKafkaProducer[(String, Int)]("defult_topic",
 new
UserDefinedKeyedSerializationSchema, props, Semantic.AT_LEAST_ONCE)
 //3.执⾏DataStream的转换算⼦
 val counts = text.flatMap(line=>line.split("\\s+"))
 .map(word=>(word,1))
 .keyBy(0)
 .sum(1)
 counts.addSink(kafakaSink)
 //5.执⾏流计算任务
 env.execute("Window Stream WordCount")

Operators 算子

DataStream Transformations

DataStream → DataStream

Map
获取一个元素并生成一个元素。一个映射函数,使输入流的值加倍

dataStream.map { x => x * 2 }

FlatMap
获取一个元素并生成零个、一个或多个元素。一个平面图功能,将句子分割成单词:

dataStream.flatMap { str => str.split(" ")}

Filter
对每个元素求布尔函数的值,并保留函数返回true的元素:

dataStream.filter { _ != 0 }

DataStream* → DataStream(多合一)

Union
两个或多个数据流的并集,创建包含来自所有流的所有元素的新流

注意:如果您将一个数据流与它自己相结合,您将得到结果流中的每个元素两次

dataStream.union(otherStream1, otherStream2, ...)

DataStream,DataStream → ConnectedStreams
connect:“连接”两个数据流,保持它们的类型,允许在两个流之间共享状态

someStream : DataStream[Int] = ...
otherStream : DataStream[String] = ...
val connectedStreams = someStream.connect(otherStream)

ConnectedStreams → DataStream
CoMap, CoFlatMap
类似于连接数据流上的映射和平面映射

connectedStreams.map(
 (_ : Int) => true,
 (_ : String) => false
)
connectedStreams.flatMap(
 (_ : Int) => true,
 (_ : String) => false
)
val env = StreamExecutionEnvironment.getExecutionEnvironment
 val text1 = env.socketTextStream("CentOS", 9999)
 val text2 = env.socketTextStream("CentOS", 8888)
 text1.connect(text2)
 .flatMap((line:String)=>line.split("\\s+"),(line:String)=>line.split("\\s+"))
 .map((_,1))
 .keyBy(0)
 .sum(1)
 .print("总数")
 env.execute("Stream WordCount")

DataStream → SplitStream

Split 根据某些标准将流分成两个或多个流。

val split = someDataStream.split(
 (num: Int) =>
 (num % 2) match {
 case 0 => List("even")
 case 1 => List("odd")
 }
)

SplitStream → DataStream

Select 可在拆分流中 选择输出 分支流

val even = split.select("even")
val odd = split.select("odd")
val all = split.select("even","odd")
val env = StreamExecutionEnvironment.getExecutionEnvironment
 val text1 = env.socketTextStream("CentOS", 9999)
 var splitStream= text1.split(line=> {
 if(line.contains("error")){
 List("error")
 } else{
 List("info")
 }
 })
 splitStream.select("error").printToErr("错误")
 splitStream.select("info").print("信息")
 splitStream.select("error","info").print("All")
 env.execute("Stream WordCount")

PrcoessFunction
⼀般来说,更多使⽤PrcoessFunctio完成流的分⽀。

 val env = StreamExecutionEnvironment.getExecutionEnvironment
 val text = env.socketTextStream("CentOS", 9999)
 val errorTag = new OutputTag[String]("error")
 val allTag = new OutputTag[String]("all")
 val infoStream = text.process(new ProcessFunction[String, String] {
	 override def processElement(value: String,ctx: ProcessFunction[String, String]#Context,out: Collector[String]): Unit = {
	 if (value.contains("error")) {
	 ctx.output(errorTag, value) //边输出
	 } else {
	 out.collect(value) //正常数据
	 }
	 ctx.output(allTag, value) //边输出
	 }
 })
 infoStream.getSideOutput(errorTag).printToErr("错误")
 infoStream.getSideOutput(allTag).printToErr("所有")
 infoStream.print("正常")
 env.execute("Stream WordCount")

DataStream → KeyedStream

keyBy
逻辑上将一个流划分为不相连的分区,每个分区包含相同键的元素。在内部,这是通过哈希分区实现的

dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple

KeyedStream → DataStream

Reduce
KeyedStream数据流上的“滚动”减少。将当前元素与最后一个减少的值组合在一起,并发出新的值。

keyedStream.reduce(_ + _)
val env = StreamExecutionEnvironment.getExecutionEnvironment
 val lines = env.socketTextStream("CentOS", 9999)
 lines.flatMap(_.split("\\s+"))
 .map((_,1))
 .keyBy("_1")
 .reduce((v1,v2)=>(v1._1,v1._2+v2._2))
 .print()
 env.execute("Stream WordCount")

Fold
具有初始值的KeyedStream数据流上的“滚动”折叠。将当前元素与最后一个折叠值组合并发出新值。当作用于序列(1,2,3,4,5)时,发出序列“start-1”、“start-1-2”、“start-1-2”、“start-1- 3”、…

val result: DataStream[String] =
keyedStream.fold("start")((str, i) => { str + "-" + i })
val env = StreamExecutionEnvironment.getExecutionEnvironment
val lines = env.socketTextStream("CentOS", 9999)
lines.flatMap(_.split("\\s+"))
 .map((_,1))
 .keyBy("_1")
 .fold((null:String,0:Int))((z,v)=>(v._1,v._2+z._2))
 .print()
env.execute("Stream WordCount")

Aggregations
在 keyedStream 数据流上做滚动聚合 minBy和min的关系 min是返回最小值,而minBy返回该字段中具有最小值的元素(max和maxBy也是如此)。

keyedStream.sum(0)
keyedStream.sum("key")
keyedStream.min(0)
keyedStream.min("key")
keyedStream.max(0)
keyedStream.max("key")
keyedStream.minBy(0)
keyedStream.minBy("key")
keyedStream.maxBy(0)
keyedStream.maxBy("key")
 val env = StreamExecutionEnvironment.getExecutionEnvironment
 //zhangsan 研发部 1000
 //lisi 研发部 ww 销售部 90005000
 //ww 销售部 9000
 val lines = env.socketTextStream("CentOS", 9999)
 lines.map(line=>line.split(" "))
 .map(ts=>Emp(ts(0),ts(1),ts(2).toDouble))
 .keyBy("dept")
 .maxBy("salary")//Emp(lisi,研发部,5000.0)
 .print()
 env.execute("Stream WordCount")

如果使⽤时max,则返回的是Emp(zhangsan,研发部,5000.0)

Physical partitioning 物理分区

Flink还通过以下function对转换后的DataStream进⾏分区(如果需要)

  • Rebalancing (Round-robin partitioning):分区元素轮循,从⽽为每个分区创建相等的负载。在存在数据偏斜的情况下对性能优化有⽤
dataStream.rebalance()
  • Random partitioning:根据均匀分布对元素进⾏随机划分。
dataStream.shuffle()
  • Rescaling 和Roundrobin Partitioning⼀样,Rescaling Partitioning也是⼀种通过循环的⽅式进⾏数据重平衡的分区策略。但是不同的是,当使⽤Roundrobin Partitioning时,数据会全局性地通过⽹络介质传输到其他的节点完成数据的重新平衡,⽽Rescaling Partitioning仅仅会对上下游继承的算⼦数据进⾏重平衡,具体的分区主要根据上下游算⼦的并⾏度决定。例如上游算⼦的并发度为2,下游算⼦的并发度为4,就会发⽣上游算⼦中⼀个分区的数据按照同等⽐例将数据路由在下游的固定的两个分区中,另外⼀个分区同理路由到下游两个分区中。
dataStream.rescale()
  • Broadcasting: 广播类似于Spark中的广播变量 ,给每一个分区都进行广播
dataStream.broadcast
  • Custom partitioning :可根据需求 自定义一些分区策略
dataStream.partitionCustom(partitioner, "someKey")
dataStream.partitionCustom(partitioner, 0)
 val env = StreamExecutionEnvironment.getExecutionEnvironment
 env.socketTextStream("CentOS", 9999)
 .map((_,1))
 .partitionCustom(new Partitioner[String] {
 override def partition(key: String, numPartitions: Int): Int = {
	 key.hashCode & Integer.MAX_VALUE % numPartitions
	 }
 	},_._1)
 .print()
 .setParallelism(4)
 println(env.getExecutionPlan)
 env.execute("Stream WordCount")

Task chaining and resource groups

对两个⼦操作进⾏Chain,意味着将这两个 算⼦放置⼦⼀个线程中,这样是为了没必要的线程开销,提升性能。如果可能的话,默认情况下Flink会链接运算符。例如⽤户可以调⽤:

StreamExecutionEnvironment.disableOperatorChaining()

禁⽤chain⾏为,但是不推荐。

  • startNewChain :将第⼀map算⼦和filter算⼦进⾏隔离 改变系统默认的划分
someStream.filter(...).map(...).startNewChain().map(...)
  • disableChaining
someStream.map(...).disableChaining()

所有操作符禁⽌和map操作符进⾏chain

  • slotSharingGroup:设置操作的slot共享组。 Flink会将具有相同slot共享组的operator放在同⼀个Task slot中,同时将没有slot共享组的operator保留在其他Task slot中。这可以⽤来隔离Task Slot。下游的操作符会⾃动继承上游资源组。默认情况下,所有的输⼊算⼦的资源组的名字是 default ,因此当⽤户不对程序进⾏资源划分的情况下,⼀个job所需的资源slot,就等于最⼤并⾏度的Task。
someStream.filter(...).slotSharingGroup("name")

你可能感兴趣的:(Apache,Flink流计算)