Flink:实时数据处理(3.Flink流处理API)

文章目录

  • 1.Enviroment(创建 Flink 程序执行环境)
    • 1.1 getExecutionEnvironment()
    • 1.2 createLocalEnvironment():创建本地执行环境
    • 1.3 createRemoteEnvironment():创建远程执行环境
  • 2.Source(读取输入流)
    • 2.1 从集合读取
    • 2.2 从文件读取
    • 2.3 从kafka读取
    • 2.4 自定义Source
  • 3.Transform(转换算子)
    • 3.1 基本转换算子
      • 3.1.1 Map
      • 3.1.2 Filter
      • 3.1.3 FlatMap
    • 3.2 键控流转换算子
      • 3.2.1 Keyby
      • 3.2.2 Reduce
    • 3.3 多流转换算子
      • 3.3.1 Union
      • 3.3.2 Connect
      • 3.3.3 Split
    • 3.4 分布式转换算子
      • 3.4.1 Random
      • 3.4.2 Round-Robin
      • 3.4.3 Rescale
      • 3.4.4 Broadcast
      • 3.4.5 Global
      • 3.4.6 Custom
    • 3.5 设置并行度
    • 3.6 实现UDF函数
      • 3.6.1 函数类(Function Classes)
      • 3.6.2 匿名函数 (Lambda Functions)
      • 3.6.3 富函数 (Rich Functions)
  • 4.Sink(对外输出流)
    • 4.1 Kafka
    • 4.2 Redis
    • 4.3 ElasticSearch
    • 4.4 JDBC 自定义 sink

Flink 流处理程序的结构如下:

  1. 创建 Flink 程序执行环境。
  2. 从数据源读取一条或者多条流数据
  3. 使用流转换算子实现业务逻辑
  4. 将计算结果输出到一个或者多个外部设备(可选)
  5. 执行程序

1.Enviroment(创建 Flink 程序执行环境)

1.1 getExecutionEnvironment()

调用静态 getExecutionEnvironment() 方法来获取执行环境

1.批处理:

val env: ExecutionEnvironment = 
ExecutionEnvironment.getExecutionEnvironment

2.流处理:

val env: StreamExecutionEnvironment = 
StreamExecutionEnvironment.getExecutionEnvironment

1.2 createLocalEnvironment():创建本地执行环境

val localEnv = StreamExecutionEnvironment
.createLocalEnvironment()

1.3 createRemoteEnvironment():创建远程执行环境

val remoteEnv = StreamExecutionEnvironment
.createRemoteEnvironment(
"host", // hostname of JobManager
123456, // port of JobManager process
"path/to/jarFile.jar"
) // JAR file to ship to the JobManager

2.Source(读取输入流)

2.1 从集合读取

val stream = env
.fromCollection(List(
SensorReading("sensor_1", 1547718199, 35.80018327300259),
SensorReading("sensor_6", 1547718199, 15.402984393403084),
SensorReading("sensor_7", 1547718199, 6.720945201171228),
SensorReading("sensor_10", 1547718199, 38.101067604893444)
))

2.2 从文件读取

import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/09  10:57
 */
object SourceFromFile {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env
      .readTextFile("F:\\ide\\moven\\flink0608\\src\\main\\resources\\test")
      .map(r => {
        val arr = r.split(",")
        SensorReading(arr(0), arr(1).toLong, arr(2).toDouble)
      })

    stream.print()
    env.execute()

  }
}

case class SensorReading(id: String,
                         timestamp: Long,
                         temperature: Double)

2.3 从kafka读取

import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}

/**
 * @Author jaffe
 * @Date 2020/06/10  00:20
 */
 
object KafkaExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val props = new Properties()
    props.put("bootstrap.servers","hadoop103:9092")
    props.put("group.id","consumer-group")
    props.put(
      "key.deserializer",
      "org.apache.kafka.common.serialization.StringDeserialization"
    )
    props.put(
      "value.deserializer",
      "org.apache.kafka.common.serialization.StringDeserialization"
    )

    props.put("auto.offset.reset","latest")

    val stream = env
      .addSource(
        new FlinkKafkaConsumer011[String](
          "test",//主题
          new SimpleStringSchema(),
          props
        )
      )
    stream.print()
    env.execute()

  }
}

2.4 自定义Source

import java.util.Calendar
import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, SourceFunction}
import scala.util.Random

/**
 * @Author jaffe
 * @Date 2020/06/09  10:57
 */

// (温度传感器ID, 时间戳,温度值)
case class SensorReading(id: String,
                         timestamp: Long,
                         temperature: Double)

// 用来源源不断的产生温度读数,造了一条数据流
// 实现自定义数据源,需要实现`RichParallelSourceFunction`
// 数据源产生的事件类型是`SensorReading`
class SensorSource extends RichParallelSourceFunction[SensorReading]{
  // 表示数据源是否正在运行,`true`表示正在运行
  var running = true
  // `run`函数会连续不断的发送`SensorReading`数据
  // 使用`SourceContext`来发送数据
  override def run(sourceContext: SourceFunction.SourceContext[SensorReading]): Unit = {
    // 初始化随机数发生器,用来产生随机的温度读数
    val rand = new Random
    // 初始化10个(温度传感器ID,温度读数)元组
    // `(1 to 10)`从1遍历到10
    var curFTemp = (1 to 10).map(
      // 使用高斯噪声产生温度读数
      i => ("sensor_"+ i, 65 + (rand.nextGaussian() * 20))

    )

    // 无限循环,产生数据流
    while (running) {
      // 更新温度
      curFTemp = curFTemp.map(t => (t._1,t._2 + (rand.nextGaussian() * 0.5)))
      // 获取当前的时间戳,单位是ms
      val curTime = Calendar.getInstance.getTimeInMillis

      // 调用`SourceContext`的`collect`方法来发射出数据
      // Flink的算子向下游发送数据,基本都是`collect`方法
      curFTemp.foreach(t => sourceContext.collect(SensorReading(t._1,curTime,t._2)))

      // 100ms发送一次数据
      Thread.sleep(100)

    }
  }
  // 当取消任务时,关闭无限循环
  override def cancel(): Unit = running = false
}

执行

import com.jaffe.day02.SensorSource
import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/09  10:57
 */
object SourceFromCustomDataSource {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env
      // 添加数据源
      .addSource(new SensorSource)

    stream.print()
    env.execute()
  }
}

3.Transform(转换算子)

3.1 基本转换算子

3.1.1 Map

将每一个输入的事件传送到一个用户自定义的 mapper,输出和输入事件的类型可能不一样

import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.api.common.functions.MapFunction
import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/09  10:55
 */
object MapExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)
    // 1.`MyMapFunction`实现了`MapFunction`接口
    stream.map(new MyMapFunction).print()

    // 2.使用匿名类的方式实现`MapFunction`接口
    stream
      .map(
        new MapFunction[SensorReading,String] {
          override def map(t: SensorReading): String = t.id
        }
      )
      .print()

    // 3.使用匿名函数的方式抽取传感器ID
    stream.map(t => t.id).print()
    env.execute()

  }

  class MyMapFunction extends MapFunction[SensorReading,String]{
    override def map(t: SensorReading): String = t.id
  }
}

3.1.2 Filter

在每个输入事件上对一个布尔条件进行求值来过滤掉一些元素,然后 将剩下的元素继续发送

import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.api.common.functions.FilterFunction
import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/09  21:06
 */
object FilterExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)

    // 将传感器ID为`sensor_1`的过滤出来
    // 1.使用匿名函数
    stream
      .filter(t => t.id.equals("sensor_1")).print()

    //2.使用匿名类
    stream
      .filter(
        new FilterFunction[SensorReading] {
          override def filter(t: SensorReading): Boolean = t.id.equals("sensor_1")
        }
      )

    //3.实现类
    stream.filter(new MyFilterFunction).print()
    env.execute()

  }
  class MyFilterFunction extends FilterFunction[SensorReading]{
    override def filter(t: SensorReading): Boolean = t.id.equals("sensor_1")
  }
}

3.1.3 FlatMap

针对每一个输入事件 flatMap 可以生成 0个、1 个或者多个输出元素

import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.api.common.functions.FlatMapFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

/**
 * @Author jaffe
 * @Date 2020/06/09  10:55
 */
object FlatMapExample {
  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)

    // 使用`FlatMapFunction`实现`MapExample.scala`中的功能
    stream
      .flatMap(
        new FlatMapFunction[SensorReading,String] {
          override def flatMap(t: SensorReading, collector: Collector[String]): Unit = {
            // 使用`collect`方法向下游发送抽取的传感器ID
            collector.collect(t.id)
          }
        }
      )
     // .print()

    // 使用`FlatMapFunction`实现`FilterExample.scala`中的功能
    stream
      .flatMap(
        new FlatMapFunction[SensorReading,SensorReading] {
          override def flatMap(t: SensorReading, collector: Collector[SensorReading]): Unit = {
            if (t.id.equals("sensor_1")){
              collector.collect(t)
            }
          }
        }
      )
      .print()
    env.execute()

  }
}

3.2 键控流转换算子

对数据进行分组,分组后的数据共享某一个相同的属性

3.2.1 Keyby

通过指定 key 来将 DataStream 转换成 KeyedStream。基于不同的 key,流中的事件将被分配到不同的分区中去

import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/09  23:26
 */
object KeyByExampleFromDoc {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val inputStream = env.fromElements(
      (1, 2, 2), (2, 3, 1), (2, 2, 4), (1, 5, 3)
    )

    val result = inputStream
      .keyBy(0)// 使用元组的第一个元素进行分组
      .sum(1)// 累加元组第二个元素
      .print()

    env.execute()

  }
}

3.2.2 Reduce

1.把每一个输入事件和当前已经 reduce 出来的值做聚合计算。reduce 操作不会改变流的事件类型。输出流数据类型和输入流数据类型是一样的

2.reduce 作为滚动聚合的泛化实现,同样也要针对每一个 key 保存状态。因为状态从来不会清空,所以我们需要将 reduce 算子应用在一个有限 key 的流上

import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/09  23:33
 */
object KeyByReduceExampleFromDoc {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val input = env
      .fromElements(
        ("en", List("tea")),
        ("fr", List("vin")),
        ("en", List("cake"))
      )
      
    input.keyBy(0)
    // `:::`用来拼接列表
      .reduce((r1,r2) => (r1._1,r1._2:::r2._2))
      .print()

    env.execute()
  }
}

3.3 多流转换算子

许多应用需要摄入多个流并将流合并处理,也需要将一条流分割成多条流然后针对每一条流应用不同的业务逻辑

3.3.1 Union

将两条或者多条 DataStream 合并成一条具有与输入流相同类型的输 出 DataStream

import com.jaffe.day02.SensorSource
import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/09  14:34
 */
object UnionExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    // 传感器ID为sensor_1的数据为来自巴黎的流
    val parisStream = env.addSource(new SensorSource)
      .filter(r => r.id.equals("sensor_1"))
    
    // 传感器ID为sensor_2的数据为来自东京的流
    val tokyoStream = env.addSource(new SensorSource)
      .filter(r => r.id.equals("sensor_2"))

    // 传感器ID为sensor_3的数据为来自里约的流
    val rioStream = env.addSource(new SensorSource)
      .filter(r => r.id.equals("sensor_3"))

    val allCities = parisStream
      .union(
        tokyoStream,
        rioStream
      )

    allCities.print
    env.execute()
  }
}

3.3.2 Connect

Connect只能操作两条流,两条流的类型可以不一样

1.CoMapFunction

import org.apache.flink.streaming.api.functions.co.CoMapFunction
import org.apache.flink.streaming.api.scala._

object CoMapExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val one: DataStream[(Int, Long)] = env.fromElements((1, 1L))
    val two: DataStream[(Int, String)] = env.fromElements((1, "two"))

    // 将key相同的联合到一起
    val connected: ConnectedStreams[(Int, Long), (Int, String)] = one.keyBy(_._1)
      .connect(two.keyBy(_._1))

    val printed: DataStream[String] = connected
      .map(new MyCoMap)

    printed.print
    env.execute()
  }

  class MyCoMap extends CoMapFunction[(Int, Long), (Int, String), String] {
    override def map1(value: (Int, Long)): String = value._2.toString + "来自第一条流"

    override def map2(value: (Int, String)): String = value._2 + "来自第二条流"
  }
}

2.CoFlatMapFunction

import org.apache.flink.streaming.api.functions.co.{CoFlatMapFunction, CoMapFunction}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

object CoFlatMapExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
//    env.setParallelism(1)
    println(env.getParallelism) // 打印默认并行度

    val one: DataStream[(Int, Long)] = env
      .fromElements((1, 1L))
      .setParallelism(1)
    val two: DataStream[(Int, String)] = env
      .fromElements((1, "two"))
      .setParallelism(1)

    // 将key相同的联合到一起
    val connected: ConnectedStreams[(Int, Long), (Int, String)] = one.keyBy(_._1)
      .connect(two.keyBy(_._1))

    val printed: DataStream[String] = connected
      .flatMap(new MyCoFlatMap)

    printed.print

    env.execute()
  }

  class MyCoFlatMap extends CoFlatMapFunction[(Int, Long), (Int, String), String] {
    override def flatMap1(value: (Int, Long), out: Collector[String]): Unit = {
      out.collect(value._2.toString + "来自第一条流")
      out.collect(value._2.toString + "来自第一条流")
    }

    override def flatMap2(value: (Int, String), out: Collector[String]): Unit = {
      out.collect(value._2 + "来自第二条流")
    }
  }
}

3.3.3 Split

Split 是 Union 的反函数。Split 将输入的流分成两条或者多条流。每一个输入的元素都可以被路由到 0、1 或者多条流中去。所以,split 可以用来过滤或者复制元素

import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/09  10:54
 */
object SplitExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val inputStream = env.fromElements(
      (1001,"1001"),
        (999,"999")
    )

    val splitted = inputStream
      .split(t => if(t._1 > 1000) Seq("large") else Seq("small"))

//select() 方法,可以根据分流后不同流的名字,将某个名字对应的流提取出来
    val large = splitted.select("large")
    val small = splitted.select("small")
    val all = splitted.select("small","large")

    large.print
    small.print()
    //all.print()
    env.execute()

  }
}

3.4 分布式转换算子

3.4.1 Random

随机数据交换由 DataStream.shuffle() 方法实现。shuffle 方法将数据随机的分配到下游算子的并行任务中去

3.4.2 Round-Robin

rebalance() 方法使用 Round-Robin 负载均衡算法将输入流平均分配到随后的并行运行的任务中去

3.4.3 Rescale

rescale() 方法使用的也是 round-robin 算法,但只会将数据发送到接下来的并行运行的任务中的一部分任务中。当发送者任务数量和接收者任务数量不一样时,rescale 分区策略提供了一种轻量级的负载均衡策略。如果接收者任务的数量是发送者任务的数量的倍数时,rescale 操作将会效率更高

3.4.4 Broadcast

broadcast() 方法将输入流的所有数据复制并发送到下游算子的所有并行任务中去

3.4.5 Global

global() 方法将所有的输入流数据都发送到下游算子的第一个并行任务中去。这个操作需要很谨慎,因为将所有数据发送到同一个 task,将会对应用程序造成很大的压力

3.4.6 Custom

当 Flink 提供的分区策略都不适用时,可以使用 partitionCustom() 方法来自定义分区策略。这个方法接收一个 Partitioner 对象,这个对象需要实现分区逻辑以及定义针对流的哪一个字段或者 key 来进行分区

例:将一条整数流做 partition,使得所有的负整数都发送到第一个任务中,剩下的数随机分配。

val numbers: DataStream[(Int)] = ...
numbers.partitionCustom(myPartitioner, 0)
object myPartitioner extends Partitioner[Int] {
val r = scala.util.Random
override def partition(key: Int, numPartitions: Int): Int = {
if (key < 0) 0 else r.nextInt(numPartitions)
} }

3.5 设置并行度

算子的并行度可以在执行环境这个层级来控制,也可以针对每个不同的算子设置不同的并行度。默认情况下,应用程序中所有算子的并行度都将设置为执行环境的并行度。执行环境的并行度将在程序开始运行时自动初始化。如果应用程序在本地执行环境中运行,并行度将被设置为 CPU 的核数。当我们把应用程序提交到一个处于运行中的 Flink 集群时,执行环境的并行度将被设置为集群默认的并行度,除非我们在客户端提交应用程序时显式的设置好并行度

val env = StreamExecutionEnvironment.getExecutionEnvironment
val defaultP = env.getParallelism
val result = env.addSource(new CustomSource)
//设置map算子的并行度为默认并行度的2倍
.map(new MyMapper).setParallelism(defaultP * 2)
 //设置print算子的并行度为默认并行度的2倍
.print().setParallelism(2)

3.6 实现UDF函数

3.6.1 函数类(Function Classes)

Flink 暴露了所有 udf 函数的接口 (实现方式为接口或者抽象类)。例如 MapFunction, FilterFunction, ProcessFunction等

3.6.2 匿名函数 (Lambda Functions)

匿名函数可以实现一些简单的逻辑,但无法实现一些高级功能,例如访问状态等等。

val tweets: DataStream[String] = ...
val flinkTweets = tweets.filter(_.contains("flink"))

3.6.3 富函数 (Rich Functions)

所有Flink函数类都有其Rich版本。它与常规函数的不同在于,可以获取运行环境的上下文,并拥有一些生命周期方法

import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

/**
 * @Author jaffe
 * @Date 2020/06/10  00:10
 */
object RichExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env
      .fromElements(1,2,3)

    stream
      .flatMap(new MyFlatMap)
      .print()

    env.execute()
  }

  class MyFlatMap extends RichFlatMapFunction[Int,Int]{
  
//open() 方法是 rich function 的初始化方法,当一个算子被调用之前open()会被调用。open() 函数通常用来做一些只需要做一次即可的初始化工作
    override def open(parameters: Configuration): Unit = {
      println("start the live cycle")
    }

    override def flatMap(in: Int, collector: Collector[Int]): Unit = {
      println("并行任务索引是:" + getRuntimeContext.getIndexOfThisSubtask)
      collector.collect(in)

    }
// close() 方法是生命周期中的最后一个调用的方法,通常用来做一些清理工作
    override def close(): Unit = {
      println("stop the live cycle")
    }
    
  }
}

4.Sink(对外输出流)

4.1 Kafka

1.添加依赖:

Kafka 版本为 0.11
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.11_2.11</artifactId>
<version>1.10.0</version>
</dependency>

Kafka 版本为 2.0 以上
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.10.0</version>
</dependency

2.代码实现:

import java.util.Properties
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}

/**
 * @Author jaffe
 * @Date 2020/06/10  00:20
 */
object KafkaExample {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val props = new Properties()
    props.put("bootstrap.servers","hadoop103:9092")
    props.put("group.id","consumer-group")
    props.put(
      "key.deserializer",
      "org.apache.kafka.common.serialization.StringDeserialization"
    )
    props.put(
      "value.deserializer",
      "org.apache.kafka.common.serialization.StringDeserialization"
    )

    props.put("auto.offset.reset","latest")

//从kafka读取
    val stream = env
      .addSource(
        new FlinkKafkaConsumer011[String](
          "test",//主题
          new SimpleStringSchema(),
          props
        )
      )

//输出到kafka
    stream.addSink(
      new FlinkKafkaProducer011[String](
        "hadoop103:9092",
        "test",
        new SimpleStringSchema()
      )
    )
    stream.print()
    env.execute()

  }
}

4.2 Redis

1.添加依赖:

<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>

2.代码实现:

import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}

/**
 * @Author jaffe
 * @Date 2020/06/10  08:40
 */
object SinkToRedis {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)

    // redis的主机
    val conf = new FlinkJedisPoolConfig.Builder()
        .setHost("hadoop103")
        .build()
    stream.addSink(new RedisSink[SensorReading](conf,new MyRedisMapper))

    env.execute()

  }
  class MyRedisMapper extends RedisMapper[SensorReading]{
    // 要使用的redis命令
    override def getCommandDescription: RedisCommandDescription = {
      new RedisCommandDescription(RedisCommand.HSET,"sensor")
    }
    // 哈希表中的key是什么
    override def getKeyFromData(t: SensorReading): String = t.id
    // 哈希表中的value是什么
    override def getValueFromData(t: SensorReading): String = t.temperature.toString
  }
}

4.3 ElasticSearch

1.添加依赖:

<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-elasticsearch6_2.11</artifactId>
<version>1.10.0</version>
</dependency>

2.代码实现:

import java.util

import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.http.HttpHost
import org.elasticsearch.client.Requests

/**
 * @Author jaffe
 * @Date 2020/06/10  09:22
 */
object SinkToES {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)

    // es的主机和端口
    val httpHosts = new util.ArrayList[HttpHost]()
    httpHosts.add(new HttpHost("hadoop103", 8300))

    // 定义了如何将数据写入到es中去
    val esSinkBuilder =new ElasticsearchSink
      .Builder[SensorReading](
      httpHosts, // es的主机名
      // 匿名类,定义如何将数据写入到es中
    new ElasticsearchSinkFunction[SensorReading] {
      override def process(t: SensorReading,
                           runtimeContext: RuntimeContext,
                           requestIndexer: RequestIndexer): Unit = {
        
        // 哈希表的key为string,value为string
        val json = new util.HashMap[String, String]()
        json.put("data", t.toString)
        // 构建一个写入es的请求
        val indexRequst = Requests
          .indexRequest()
          // 索引的名字是sensor
          .index("sensor")
          .`type`("readingData")
          .source(json)

        requestIndexer.add(indexRequst)
      }
    }
)

    // 用来定义每次写入多少条数据
    // 成批的写入到es中去
    esSinkBuilder.setBulkFlushMaxActions(10)
    stream.addSink(esSinkBuilder.build())
    env.execute()
  }
}

4.4 JDBC 自定义 sink

1.添加依赖:

添加Mysql依赖,写入mysql
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.44</version>
</dependency>

2.代码实现:

import java.sql.{Connection, DriverManager, PreparedStatement}
import com.jaffe.day02.{SensorReading, SensorSource}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import org.apache.flink.streaming.api.scala._

/**
 * @Author jaffe
 * @Date 2020/06/10  10:07
 */
object SinkToMySQL {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val stream = env.addSource(new SensorSource)

    stream.addSink(new MyJdbcSink)
    env.execute()
  }

  class MyJdbcSink extends RichSinkFunction[SensorReading]{
    //连接
    var conn:Connection = _
    // 插入语句
    var insertStmt:PreparedStatement = _
    // 更新语句
    var updateStmt:PreparedStatement = _

    // 生命周期开始,建立连接
    override def open(parameters: Configuration): Unit = {
      conn = DriverManager.getConnection(
        "jdbc:mysql://hadoop103:3306/test",
        "root",
        "123456"
      )

      insertStmt = conn.prepareStatement(
        "insert into sinkToMysql(id,temperature) value (?,?)"
      )

      updateStmt = conn.prepareStatement(
        "update sinkToMysql set temperature = ? where id = ?"
      )
    }

    // 执行sql语句
    override def invoke(value: SensorReading, context: SinkFunction.Context[_]): Unit = {
      updateStmt.setDouble(1,value.temperature)
      updateStmt.setString(2,value.id)
      updateStmt.execute()

      if (updateStmt.getUpdateCount == 0){
        insertStmt.setString(1,value.id)
        insertStmt.setDouble(2,value.temperature)
        insertStmt.execute()
      }
    }

    // 生命周期结束,清理工作
    override def close(): Unit = {
      insertStmt.close()
      updateStmt.close()
      conn.close()
    }
  }
}

你可能感兴趣的:(bigdata)