Spark Streaming与外部存储介质

一、将DStream输出到文件中

Spark Streaming提供了多个上层接口,用于将DStream书出到外部文件,包括saveAsObjectFiles、saveAsTextFiles、saveAsHadoopFiles,可以分别将DStram输出到序列化文件,文本文件及Hadoop文件中。
下面简单的词频统计将DStream输出到文本文件中

关键步骤

1、构建一个流式上线文,配置我们Spark集群的地址
2、利用textFileStream从传入的路径读入我们的文本文件,注意,textFileStream只会监控读取指定目录新建文件的内容,这里将需要统计的文件复制到input目录下
3、对文本文件进行词频统计
4、利用print、saveAsTextFiles、saveAsObjectFiles,三种输出操作

package doc

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * @Author huangwei
  * @Date 19-10-17 
  * @Comments : 将DStream 输出到文件中
  **/
object DStreamSaveFile {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
      .setAppName("DStreamSavaFile")
      .setMaster("local[*]")
    val ssc = new StreamingContext(conf,Seconds(3))

    val input = "/home/huangwei/input"
    val output = "/home/huangwei"
    println("read file name:" + input)
    val textStream = ssc.textFileStream("file://"+input)
    val wcStream = textStream.flatMap{ line => line.split(" ")}
      .map{ word => (word,1)}
      .reduceByKey(_ + _)
    wcStream.print()
    // 保存到指定目录
    wcStream.saveAsTextFiles("file://"+output+"/saveAsObjectFiles")
     wcStream.saveAsObjectFiles("file://"+output+"/saveAsObjectFiles")

    ssc.start()
    ssc.awaitTermination()
  }

}

Spark Streaming与外部存储介质_第1张图片

二、将DStream输出到MySQL中

应用C3P0连接池,建立一个数据库连接的通用类

package doc

import java.sql.Connection

import com.mchange.v2.c3p0.ComboPooledDataSource


/**
  * @Author huangwei
  * @Date 19-10-17 
  * @Comments
  **/
class MysqlPool extends Serializable {
  private val cpds:ComboPooledDataSource = new ComboPooledDataSource(true)
  private val conf = Conf.mysqlConfig
  try {
    // 利用c3p00设置MySQL的各类信息
    cpds.setJdbcUrl(conf.get("url").getOrElse("jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=UTF-8"))
    cpds.setDriverClass("com.mysql.jdbc.Driver")
    cpds.setUser(conf.get("username").getOrElse("root"))
    cpds.setPassword(conf.get("password").getOrElse("Mysql_123"))
    cpds.setMaxPoolSize(200)
    cpds.setMinPoolSize(20)
    cpds.setAcquireIncrement(5)
    cpds.setMaxStatements(180)
  }catch {
    case e:Exception => e.printStackTrace()
  }
  // 获取连接
  def getConnection:Connection = {
    try {
      return cpds.getConnection()
    }catch {
      case ex:Exception => ex.printStackTrace()
        null
    }
  }
}
object MysqlManager {
  var mysqlManager:MysqlPool = _
  def getMysqlManager:MysqlPool = {
    synchronized {
      if (mysqlManager == null) {
        mysqlManager = new MysqlPool
      }
    }
    mysqlManager
  }

}

MySQL输出操作

package doc

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * @Author huangwei
  * @Date 19-10-17 
  * @Comments 将DStream输出到MySQL中
  **/
object DStreamMySQL {
  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()
      .setAppName("DStreamSavaFile")
      .setMaster("local[*]")
    val ssc = new StreamingContext(conf,Seconds(3))

    val input = "/home/huangwei/input"

    println("read file name:" + input)
    val textStream = ssc.textFileStream("file://"+input)
    val dsStream = textStream.map{line => line.split(",")}.map(f =>(f(0),f(1),f(2),f(3)))
    dsStream.print()
    dsStream.foreachRDD(rdd =>{
      if (!rdd.isEmpty()) {
        rdd.foreachPartition(partitionRecords => {
          // 从连接池中获取一个连接
          val conn = MysqlManager.getMysqlManager.getConnection
          val statement = conn.createStatement()
          try {
            conn.setAutoCommit(false)
            partitionRecords.foreach(record => {
              // SQL语句,往table中写入数据
              val sql = "insert into person(name,gender,age,homeaddress) values('"+record._1+"','"+record._2+"','"+record._3.toInt+"','"+record._4+"')"
              statement.addBatch(sql)  // 加入batch
            })
            statement.executeBatch()    // 执行batch
            conn.commit()               // 提交执行
          }catch {
            case e:Exception => e.printStackTrace()
          } finally {
            statement.close()           // 关闭状态
            conn.close()               // 关闭连接
          }
        })
      }
    })


    ssc.start()
    ssc.awaitTermination()
  }

}

错误处理:Caused by: java.sql.SQLException: Incorrect string value: ‘\xE9\x9B\xB7\xE5\x86\x9B’ for column …


这属于编码问题,是数据库的charset和collation问题。
解决方法:尝试把表的charset改为utf-8,collection改为utf8-unicode-ci。这里我是把表drop了,再重新create

mysql> DROP TABLE person;
mysql> CREATE TABLE person ( name varchar(10), gender varchar(5), age int, homeaddress varchar(10)) charset utf8 collate utf8_general_ci;

运行结果

Spark Streaming与外部存储介质_第2张图片

查询数据库结果

mysql> select * from person;
+-----------+--------+------+---------------+
| name      | gender | age  | homeaddress   |
+-----------+--------+------+---------------+
| 马云      | 男     |   55 | 浙江-杭州     |
| 马化腾    | 男     |   48 | 广东-深圳     |
| 李彦宏    | 男     |   51 | 山西-阳泉     |
| 刘强东    | 男     |   46 | 江苏-宿迁     |
| 雷军      | 男     |   50 | 湖北-仙桃     |
+-----------+--------+------+---------------+
5 rows in set (0.00 sec)

你可能感兴趣的:(Spark)