Data Sink 就是数据落地。如上图,Source就是数据的来源,中间的 Compute 其实就是 Flink干的事情,可以做一系列 的操作,操作完后就把计算后的数据结果 Sink 到某个地方。这个 sink 的意思也不一定非得说成要把数据存储到某个地方去。其实官网用的 Connector 来形容要去的地方更合适,这个 Connector 可以有 MySQL、ElasticSearch、Kafka、Cassandra、RabbitMQ 等。
Maven依赖
<properties>
<flink.version>1.7.2flink.version>
properties>
<dependencies>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-javaartifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-scala_2.11artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-java_2.11artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-scala_2.11artifactId>
<version>${flink.version}version>
dependency>
dependencies>
Flink 将转换计算后的数据发送的地点,你可能需要写入文件或者打印出来。
Flink 支持的内置 Sink 的种类:
直接打印输出
print() / printToErr()
打印标准输出/标准错误流上每个元素的 toString()值。
基于文件系统
writeAsText() / TextOutputFormat
将元素作为字符串逐行写入。通过调用每个元素的 toString()方法获得字符串
writeAsCsv(…) / CsvOutputFormat
将元组写入逗号分隔值(csv)文件。行和字段分隔符是可配置的。每个字段的值来自 对象的 toString()方法
write() / FileOutputFormat
自定义文件输出的方法和基类。支持自定义对象到字节的转换
output()/ OutputFormat
大多数通用输出方法,用于非基于文件的 Data Sink(例如将结果存储在数据库中)
SinkStandardOutput.scala
package blog.sink
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
/**
* @Author Daniel
* @Description Flink Sink——数据落地到标准输出
*
**/
object SinkStandardOutput {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val stu: DataSet[(Int, String, Double)] = env.fromElements(
(19, "Wilson", 178.8),
(17, "Edith", 168.8),
(18, "Joyce", 174.8),
(18, "May", 195.8),
(18, "Gloria", 182.7),
(21, "Jessie", 184.8)
)
println("-------------sink到标准输出--------------------")
stu.print()
println("-------------sink到标准error输出--------------------")
stu.printToErr()
println("-------------sink到本地Collection--------------------")
print(stu.collect())
}
}
SinkFile.scala
package blog.sink
import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment, _}
import org.apache.flink.core.fs.FileSystem.WriteMode
/**
* @Author Daniel
* @Description Flink Sink——数据落地到文件
*
**/
object SinkFile {
def main(args: Array[String]): Unit = {
//设置用户名,避免权限错误
System.setProperty("HADOOP_USER_NAME", "hadoop");
val env = ExecutionEnvironment.getExecutionEnvironment
val stu: DataSet[(Int, String, Double)] = env.fromElements(
(19, "Wilson", 178.8),
(17, "Edith", 168.8),
(18, "Joyce", 174.8),
(18, "May", 195.8),
(18, "Gloria", 182.7),
(21, "Jessie", 184.8)
)
println("-------------age从小到大升序排列(0->9)----------")
stu.sortPartition(0, Order.ASCENDING).print
println("-------------name从大到小降序排列(z->a)----------")
stu.sortPartition(1, Order.DESCENDING).print
println("-------------以age升序,height降序排列----------")
stu.sortPartition(0, Order.ASCENDING)
.sortPartition(2, Order.DESCENDING)
.print
println("-------------所有字段升序排列----------")
stu.sortPartition("_", Order.ASCENDING).print
case class Student(name: String, age: Int)
val ds1: DataSet[(Student, Double)] = env.fromElements(
(Student("Wilson", 19), 178.8),
(Student("Edith", 17), 168.8),
(Student("Joyce", 18), 174.8),
(Student("May", 18), 195.8),
(Student("Gloria", 18), 182.7),
(Student("Jessie", 21), 184.8)
)
//Student.name升序
//Parallelism>1将把path当成目录名称,Parallelism=1将把path当成文件名
val ds2 = ds1.sortPartition("_1.age", Order.ASCENDING).setParallelism(1)
//写入到HDFS,文本文档
val output1 = "hdfs://bdedev/flink/Student001.txt"
//NO_OVERWRITE模式下如果文件已经存在,则报错,OVERWRITE模式下如果文件已经存在,则覆盖
ds2.writeAsText(output1, WriteMode.OVERWRITE)
env.execute()
//写入到HDFS,CSV文档
val output2 = "hdfs://bdedev/flink/Student002.csv"
ds2.writeAsCsv(output2, rowDelimiter = "\n", fieldDelimiter = "|||", WriteMode.OVERWRITE)
env.execute()
}
}
自定义的 sink 常见的有 Apache kafka、RabbitMQ、MySQL、ElasticSearch、Apache Cassandra、 Hadoop FileSystem 等,同理也可以定义自己的 Sink
MySink.scala
package blog.sink
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import scala.collection.mutable
/**
* @Author Daniel
* @Description Flink Sink——判断是否将元素加入集合
*
**/
class MySink extends RichSinkFunction[(Boolean, (String, Double))] {
private var resultSet: mutable.Set[(String, Double)] = _
//初始执行一次
override def open(parameters: Configuration): Unit = {
//初始化内存存储结构
resultSet = new mutable.HashSet[(String, Double)]
}
//每个元素执行一次
override def invoke(v: (Boolean, (String, Double)), context: SinkFunction.Context[_]): Unit = {
//主要逻辑
if (v._1) {
resultSet.add(v._2)
}
}
//最后执行一次
override def close(): Unit = {
//打印
resultSet.foreach(println)
}
}
TestMySink.scala
package blog.sink
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import scala.collection.mutable
/**
* @Author Daniel
* @Description Flink Sink——判断是否将元素加入集合
*
**/
class MySink extends RichSinkFunction[(Boolean, (String, Double))] {
private var resultSet: mutable.Set[(String, Double)] = _
//初始执行一次
override def open(parameters: Configuration): Unit = {
//初始化内存存储结构
resultSet = new mutable.HashSet[(String, Double)]
}
//每个元素执行一次
override def invoke(v: (Boolean, (String, Double)), context: SinkFunction.Context[_]): Unit = {
//主要逻辑
if (v._1) {
resultSet.add(v._2)
}
}
//最后执行一次
override def close(): Unit = {
//打印
resultSet.foreach(println)
}
}