StructuredStreaming整合kafka&Mysql

前言
大家好,我是DJ丶小哪吒,我又来跟你们分享知识了。对软件开发有着浓厚的兴趣。喜欢与人分享知识。做博客的目的就是为了能与 他 人知识共享。由于水平有限。博客中难免会有一些错误。如有 纰 漏之处,欢迎大家在留言区指正。小编也会及时改正。

DJ丶小哪吒续更了。上回简单带领大家简单了解了一下StructuredStreaming,那么这一回,我们就来使用StructuredStreaming整合一些其他的技术。话不多说。进入正题

文章目录

    • 一、StructuredStreaming与其他技术整合
        • 1.1. 整合Kafka
            • 1.1.1. 官网介绍
            • 1.1.2. 整合环境准备
        • 1.2. 整合MySQL
            • 1.2.1. 简介
            • 1.2.2. 代码演示

一、StructuredStreaming与其他技术整合

1.1. 整合Kafka
1.1.1. 官网介绍

http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html

StructuredStreaming整合kafka&Mysql_第1张图片

●Creating a Kafka Source for Streaming Queries(为流查询创建一个kafka数据源)

// Subscribe to 1 topic
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]
  
// Subscribe to multiple topics(多个topic)
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1,topic2")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]
  
// Subscribe to a pattern(订阅通配符topic)
val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

●Creating a Kafka Source for Batch Queries(kafka批处理查询)

// Subscribe to 1 topic 
//defaults to the earliest and latest offsets(默认为最早和最新偏移)
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]
// Subscribe to multiple topics, (多个topic)
//specifying explicit Kafka offsets(指定明确的偏移量)
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1,topic2")
  .option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
  .option("endingOffsets", """{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
  .load()df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]
// Subscribe to a pattern, (订阅通配符topic)at the earliest and latest offsets
val df = spark
  .read
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "topic.*")
  .option("startingOffsets", "earliest")
  .option("endingOffsets", "latest")
  .load()df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

●注意:读取后的数据的Schema是固定的,包含的列如下:

Column Type 说明
key binary 消息的key
value binary 消息的value
topic string 主题
partition int 分区
offset long 偏移量
timestamp long 时间戳
timestampType int 类型

●注意:下面的参数是不能被设置的,否则kafka会抛出异常:
group.id:kafka的source会在每次query的时候自定创建唯一的group id
auto.offset.reset :为了避免每次手动设置startingoffsets的值,structured streaming在内部消费时会自动管理offset。这样就能保证订阅动态的topic时不会丢失数据。startingOffsets在流处理时,只会作用于第一次启动时,之后的处理都会自动的读取保存的offset。
key.deserializer,value.deserializer,key.serializer,value.serializer 序列化与反序列化,都是ByteArraySerializer
enable.auto.commit:Kafka源不支持提交任何偏移量

StructuredStreaming整合kafka&Mysql_第2张图片

1.1.2. 整合环境准备

●启动kafka

/export/servers/kafka/bin/kafka-server-start.sh -daemon /export/servers/kafka/config/server.properties 

●向topic中生产数据

/export/servers/kafka/bin/kafka-console-producer.sh --broker-list node01:9092 --topic  spark_kafka

代码实现

package cn.itcast.structedstreaming

import org.apache.spark.SparkContext
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}

object KafkaStructuredStreamingDemo {
  def main(args: Array[String]): Unit = {
    //1.创建SparkSession
    val spark: SparkSession = 
SparkSession.builder().master("local[*]").appName("SparkSQL").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    import spark.implicits._
    //2.连接Kafka消费数据
    val dataDF: DataFrame = spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "node01:9092")
      .option("subscribe", "spark_kafka")
      .load()
    //3.处理数据
    //注意:StructuredStreaming整合Kafka获取到的数据都是字节类型,所以需要按照官网要求,
//转成自己的实际类型
    val dataDS: Dataset[String] = dataDF.selectExpr("CAST(value AS STRING)").as[String]
    val wordDS: Dataset[String] = dataDS.flatMap(_.split(" "))
    val result: Dataset[Row] = wordDS.groupBy("value").count().sort($"count".desc)
    result.writeStream
      .format("console")
      .outputMode("complete")
      .trigger(Trigger.ProcessingTime(0))
      .option("truncate",false)//超过长度的列不截断显示,即完全显示
      .start()
      .awaitTermination()
  }
}
1.2. 整合MySQL
1.2.1. 简介

●需求
我们开发中经常需要将流的运算结果输出到外部数据库,例如MySQL中,但是比较遗憾Structured Streaming API不支持外部数据库作为接收器
如果将来加入支持的话,它的API将会非常的简单比如:
format(“jdbc”).option(“url”,“jdbc:mysql://…”).start()
但是目前我们只能自己自定义一个JdbcSink,继承ForeachWriter并实现其方法
●参考网站
https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

StructuredStreaming整合kafka&Mysql_第3张图片

1.2.2. 代码演示
package cn.itcast.structedstreaming

import java.sql.{Connection, DriverManager, PreparedStatement}

import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.streaming.Trigger


object JDBCSinkDemo {
  def main(args: Array[String]): Unit = {
    //1.创建SparkSession
    val spark: SparkSession = 
SparkSession.builder().master("local[*]").appName("SparkSQL").getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    import spark.implicits._
    //2.连接Kafka消费数据
    val dataDF: DataFrame = spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "node01:9092")
      .option("subscribe", "spark_kafka")
      .load()
    //3.处理数据
    //注意:StructuredStreaming整合Kafka获取到的数据都是字节类型,所以需要按照官网要求,转成自己的实际类型
    val dataDS: Dataset[String] = dataDF.selectExpr("CAST(value AS STRING)").as[String]
    val wordDS: Dataset[String] = dataDS.flatMap(_.split(" "))
    val result: Dataset[Row] = wordDS.groupBy("value").count().sort($"count".desc)
    val writer = new JDBCSink("jdbc:mysql://localhost:3306/bigdata?characterEncoding=UTF-8", "root", "root")
    result.writeStream
      .foreach(writer)
      .outputMode("complete")
      .trigger(Trigger.ProcessingTime(0))
      .start()
      .awaitTermination()
  }

  class JDBCSink(url:String,username:String,password:String) extends ForeachWriter[Row] with Serializable{
    var connection:Connection = _ //_表示占位符,后面会给变量赋值
    var preparedStatement: PreparedStatement = _
    //开启连接
    override def open(partitionId: Long, version: Long): Boolean = {
      connection = DriverManager.getConnection(url, username, password)
      true
    }

    /*
    CREATE TABLE `t_word` (
        `id` int(11) NOT NULL AUTO_INCREMENT,
        `word` varchar(255) NOT NULL,
        `count` int(11) DEFAULT NULL,
        PRIMARY KEY (`id`),
        UNIQUE KEY `word` (`word`)
      ) ENGINE=InnoDB AUTO_INCREMENT=26 DEFAULT CHARSET=utf8;
     */
    //replace INTO `bigdata`.`t_word` (`id`, `word`, `count`) VALUES (NULL, NULL, NULL);
    //处理数据--存到MySQL
    override def process(row: Row): Unit = {
      val word: String = row.get(0).toString
      val count: String = row.get(1).toString
      println(word+":"+count)
      //REPLACE INTO:表示如果表中没有数据这插入,如果有数据则替换
      //注意:REPLACE INTO要求表有主键或唯一索引
      val sql = "REPLACE INTO `t_word` (`id`, `word`, `count`) VALUES (NULL, ?, ?);"
      preparedStatement = connection.prepareStatement(sql)
      preparedStatement.setString(1,word)
      preparedStatement.setInt(2,Integer.parseInt(count))
      preparedStatement.executeUpdate()
    }

    //关闭资源
    override def close(errorOrNull: Throwable): Unit = {
      if (connection != null){
        connection.close()
      }
      if(preparedStatement != null){
        preparedStatement.close()
      }
    }
  }
}

好了,以上内容就到这里了。你学会了吗。 欢迎路过的朋友关注小编哦。各位朋友关注点赞是小编坚持下去的动力。小编会继续为大家分享更多的知识哦~~~。

我是DJ丶小哪吒。是一名互联网行业的工具人,小编的座右铭:“我不生产代码,我只做代码的搬运工”…哈哈哈,我们下期见哦,Bye~

不要因为没有掌声而放弃你的梦想。

你可能感兴趣的:(大数据,spark,大数据,spark)