把DStream写入到MySQL数据库中
需要基于Spark Streaming 将实时监控的套接字流统计WordCount结果保存至MySQL
提示:本项目通过sbt控制依赖
在Spark应用中,外部系统经常需要使用到Spark DStream处理后的数据,因此,需要采用输出操作把DStream的数据输出到数据库或者文件系统中
Spark Streaming是一个基于Spark的实时计算框架,它可以从多种数据源消费数据,并对数据进行高效、可扩展、容错的处理。Spark Streaming的工作原理有以下几个步骤:
DStream有状态转换操作是指在Spark Streaming中,对DStream进行一些基于历史数据或中间结果的转换,从而得到一个新的DStream。
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / scalaVersion := "2.13.11"
lazy val root = (project in file("."))
.settings(
name := "SparkLearning",
idePackagePrefix := Some("cn.lh.spark"),
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.4.1",
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.4.1",
libraryDependencies += "org.apache.hadoop" % "hadoop-auth" % "3.3.6",
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.4.1",
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10" % "3.4.1",
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "3.4.1" % "provided",
libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.30"
)
为了实现通过spark Streaming 监控控制台输入,需要开发两个代码:
NetworkWordCountStatefultoMysql.scala
package cn.lh.spark
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
object NetworkWordCountStatefultoMysql {
def main(args: Array[String]): Unit = {
// 定义状态更新函数
val updateFunc = (values: Seq[Int], state: Option[Int]) => {
val currentCount = values.foldLeft(0)(_ + _)
val previousCount = state.getOrElse(0)
Some(currentCount + previousCount)
}
// 设置log4j日志级别
StreamingExamples.setStreamingLogLevels()
val conf: SparkConf = new SparkConf().setAppName("NetworkCountStateful").setMaster("local[2]")
val scc: StreamingContext = new StreamingContext(conf, Seconds(5))
// 设置检查点,具有容错机制
scc.checkpoint("F:\\niit\\2023\\2023_2\\Spark\\codes\\checkpoint")
val lines: ReceiverInputDStream[String] = scc.socketTextStream("192.168.137.110", 9999)
val words: DStream[String] = lines.flatMap(_.split(" "))
val wordDstream: DStream[(String, Int)] = words.map(x => (x, 1))
val stateDstream: DStream[(String, Int)] = wordDstream.updateStateByKey[Int](updateFunc)
// 打印出状态
stateDstream.print()
// 将统计结果保存到MySQL中
stateDstream.foreachRDD(rdd =>{
val repartitionedRDD = rdd.repartition(3)
repartitionedRDD.foreachPartition(StreamingSaveMySQL8.writeToMySQL)
})
scc.start()
scc.awaitTermination()
scc.stop()
}
}
StreamingSaveMySQL8.scala
package cn.lh.spark
import java.sql.DriverManager
object StreamingSaveMySQL8 {
// 定义写入 MySQL 的函数
def writeToMySQL(iter: Iterator[(String,Int)]): Unit = {
// 保存到MySQL
val ip = "192.168.137.110"
val port = "3306"
val db = "sparklearning"
val username = "lh"
val pwd = "Lh123456!"
val jdbcurl = s"jdbc:mysql://$ip:$port/$db"
val conn = DriverManager.getConnection(jdbcurl, username, pwd)
val statement = conn.prepareStatement("INSERT INTO wordcount (word,count) VALUES (?,?)")
try {
// 写入数据
iter.foreach { wc =>
statement.setString(1, wc._1.trim)
statement.setInt(2, wc._2.toInt)
statement.executeUpdate()
}
} catch {
case e:Exception => e.printStackTrace()
} finally {
if(statement != null){
statement.close()
}
if(conn!=null){
conn.close()
}
}
}
}
准备工作:
启动 NetworkWordCountStatefultoMysql.scala
![[Pasted image 20230804214904.png]]
在nc端口输入字符,再分别到idea控制台和MySQL检查结果
本次实验通过IDEA基于Spark Streaming 3.4.1开发程序监控套接字流,并统计字符串,实现实时统计单词出现的数量。试验成功,相对简单。
后期改善点如下:
欢迎各位开发者一同改进代码,有问题有疑问提出来交流。谢谢!