Flink连接kafka,将DataStream转换为Table进行查询操作

分享一个大神的人工智能教程。零基础!通俗易懂!风趣幽默!还带黄段子!希望你也加入到人工智能的队伍中来!

点击浏览教程


Table API是用于流和批处理的统一关系API。 Table API查询可以在批量或流式输入上运行而无需修改。 Table API是SQL语言的超级集合,专门用于与Apache Flink一起使用。 Table API是Scala和Java语言集成API。 Table API查询不是像SQL一样将字符串值指定为SQL,而是在Java或Scala中以嵌入语言的样式定义,并支持自动完成和语法验证等IDE支持。

代码

这里先上完整代码。

需要的pom依赖:

<dependency>
    <groupId>org.apache.flinkgroupId>
    <artifactId>flink-scala_2.11artifactId>
    <version>${flink.version}version>
dependency>
<dependency>
    <groupId>org.apache.flinkgroupId>
    <artifactId>flink-table_2.11artifactId>
    <version>${flink.version}version>
dependency>
<dependency>
    <groupId>org.apache.flinkgroupId>
    <artifactId>flink-streaming-scala_2.11artifactId>
    <version>${flink.version}version>
dependency>

<dependency>
    <groupId>org.apache.flinkgroupId>
    <artifactId>flink-connector-kafka-0.8_2.11artifactId>
    <version>${flink.version}version>
dependency>

从kafka消费数据,转换为table,然后进行sql查询。
用scala开发,需要导入的包,不要漏掉,否则会有问题。

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.table.api.scala._

下面是完整代码:

package com.ddxygq.bigdata.flink.sql

import java.util.Properties

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer08
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.table.api.scala._

/**
  * @ Author: keguang
  * @ Date: 2019/2/22 16:13
  * @ version: v1.0.0
  * @ description: 
  */
object TableDemo {
  def main(args: Array[String]): Unit = {
    demo

  }

  def demo2(): Unit ={
    val env = ExecutionEnvironment.getExecutionEnvironment
    val tEnv = TableEnvironment.getTableEnvironment(env)

    val input:DataSet[WC] = env.fromElements(WC("hello", 1), WC("hello", 1), WC("ciao", 1))
    val input2:DataSet[WC] = env.fromElements(WC("hello", 1), WC("hello", 1))
    val table = input.toTable(tEnv, 'word, 'frequency)
    val table2 = input2.toTable(tEnv, 'word2, 'frequency2)
    val result = table.join(table2).where('word == 'word2).select('word, 'frequency)
    result.toDataSet[(String, Long)].print()

  }

  def demo: Unit ={
    val sEnv = StreamExecutionEnvironment.getExecutionEnvironment
    val sTableEnv = TableEnvironment.getTableEnvironment(sEnv)

    // 连接kafka
    val ZOOKEEPER_HOST = "qcloud-test-hadoop01:2181"
    val KAFKA_BROKERS = "qcloud-test-hadoop01:9092,qcloud-test-hadoop02:9092,qcloud-test-hadoop03:9092"
    val TRANSACTION_GROUP = "transaction"
    val kafkaProps = new Properties()
    kafkaProps.setProperty("zookeeper.connect",ZOOKEEPER_HOST)
    kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS)
    kafkaProps.setProperty("group.id",TRANSACTION_GROUP)
    val input = sEnv.addSource(
      new FlinkKafkaConsumer08[String]("flink-test", new SimpleStringSchema(), kafkaProps)
    )
      .flatMap(x => x.split(" "))
      .map(x => (x, 1L))

    val table = sTableEnv.registerDataStream("Words", input, 'word, 'frequency)

    val result = sTableEnv
      .scan("Words")
      .groupBy("word")
      .select('word, 'frequency.sum as 'cnt)
    sTableEnv.toRetractStream[(String, Long)](result).print()

    sTableEnv.sqlQuery("select * from Words").toAppendStream[(String, Long)].print()

    sEnv.execute("TableDemo")
  }
}

这里有两个地方:
1、这里举例用了table的算子,和标准的sql查询语法,为了演示table的基本用法。

val result = sTableEnv
      .scan("Words")
      .groupBy("word")
      .select('word, 'frequency.sum as 'cnt)

这个分组聚合统计其实可以替换成:

val result  = sTableEnv.sqlQuery("select word,sum(frequency) as cnt from Words group by word")

// 打印到控制台
sTableEnv.toRetractStream[(String, Long)](result).print()

那么这个与下面的查询结果有什么区别呢?

sTableEnv.sqlQuery("select * from Words").toAppendStream[(String, Long)].print()

区别很明显,这里消费kafka的实时数据,那么Words表是一个动态的流表,数据在不断append,一个是group by的分组聚合,结果需要不断更新,比如当前是(hello,4),这时候又来了一个词语hello,就需要update结果为(hello,5),如果有新词,还需要insert,而后者是select * from Words,只是追加结果。

所以,这里只是展示打印到控制台的写法不同,前者调用的是toRetractStream方法,而后者是调用toAppendStream

将表转换为DataStream或DataSet

将表转换为DataStream

将a转换Table为a 有两种模式DataStream:

  • 追加模式(toAppendStream):只有在动态Table仅通过INSERT更改修改时才能使用此模式,即它仅附加并且以前发出的结果永远不会更新。
  • 缩进模式(toRetractStream):始终可以使用此模式。它用标志编码INSERT和DELETE改变boolean。
/ get TableEnvironment.
// registration of a DataSet is equivalent
val tableEnv = TableEnvironment.getTableEnvironment(env)

// Table with two fields (String name, Integer age)
val table: Table = ...

// convert the Table into an append DataStream of Row
val dsRow: DataStream[Row] = tableEnv.toAppendStream[Row](table)

// convert the Table into an append DataStream of Tuple2[String, Int]
val dsTuple: DataStream[(String, Int)] dsTuple =
  tableEnv.toAppendStream[(String, Int)](table)

// convert the Table into a retract DataStream of Row.
//   A retract stream of type X is a DataStream[(Boolean, X)].
//   The boolean field indicates the type of the change.
//   True is INSERT, false is DELETE.
val retractStream: DataStream[(Boolean, Row)] = tableEnv.toRetractStream[Row](table)

将表转换为DataSet

// get TableEnvironment
// registration of a DataSet is equivalent
val tableEnv = TableEnvironment.getTableEnvironment(env)

// Table with two fields (String name, Integer age)
val table: Table = ...

// convert the Table into a DataSet of Row
val dsRow: DataSet[Row] = tableEnv.toDataSet[Row](table)

// convert the Table into a DataSet of Tuple2[String, Int]
val dsTuple: DataSet[(String, Int)] = tableEnv.toDataSet[(String, Int)](table)

我的个人博客


在这里插入图片描述

你可能感兴趣的:(flink)