Spark Streaming实时流处理项目6——Spark Streaming实战1

Spark Streaming实时流处理项目1——分布式日志收集框架Flume的学习

Spark Streaming实时流处理项目2——分布式消息队列Kafka学习

Spark Streaming实时流处理项目3——整合Flume和Kafka完成实时数据采集

Spark Streaming实时流处理项目4——实战环境搭建

Spark Streaming实时流处理项目5——Spark Streaming入门

Spark Streaming实时流处理项目6——Spark Streaming实战1

Spark Streaming实时流处理项目7——Spark Streaming实战2

Spark Streaming实时流处理项目8——Spark Streaming与Flume的整合

Spark Streaming实时流处理项目9——Spark Streaming整合Kafka实战

Spark Streaming实时流处理项目10——日志产生器开发并结合log4j完成日志的输出

Spark Streaming实时流处理项目11——综合实战

源码​​​​​​​

案例一:Spark Streaming处理socket数据

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * @author YuZhansheng
  * @desc SparkStreaming处理socket数据
  * @create 2019-02-19 11:26
  */
object NetworkWordCount {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    //创建StreamingContext需要两个参数:SparkConf和batch interval
    val ssc = new StreamingContext(sparkConf,Seconds(5))

    val lines = ssc.socketTextStream("localhost",6789)

    val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

    result.print()

    ssc.start()
    ssc.awaitTermination()
  }

}

测试:使用nc来测试 nc -lk 6789

发送数据,控制台打印出数据词频。

案例二:Spark Streaming处理文件系统数据(包括HDFS和本地文件系统)

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * @author YuZhansheng
  * @desc Spark Streaming处理文件系统数据(包括HDFS和本地文件系统)
  * @create 2019-02-19 21:31
  */
object FileWordCount {

    def main(args: Array[String]): Unit = {

        val sparkConf = new SparkConf().setMaster("local").setAppName("FileWordCount")
        val ssc = new StreamingContext(sparkConf,Seconds(5))

        //监控/root/DataSet这个文件下文件内容
        val lines = ssc.textFileStream("/root/DataSet")

        val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

        result.print()

        ssc.start()
        ssc.awaitTermination()
    }
}

往/root/DataSet文件夹下复制或者移动文本文件,观察控制台输出。

案例三:使用Spark Streaming完成词频统计,并将结果写入到MySQL中

小插曲:忘记了MySQLroot密码了,折腾了一会儿,才修改回来,记录一下。参看这篇博客

步骤:

1.首先确认服务器出于安全的状态,也就是没有人能够任意地连接MySQL数据库。 
因为在重新设置MySQL的root密码的期间,MySQL数据库完全出于没有密码保护的 
状态下,其他的用户也可以任意地登录和修改MySQL的信息。可以采用将MySQL对 
外的端口封闭,并且停止Apache以及所有的用户进程的方法实现服务器的准安全 
状态。最安全的状态是到服务器的Console上面操作,并且拔掉网线。

2.修改MySQL的登录设置: 
# vim /etc/my.cnf 
在[mysqld]的段中加上一句:skip-grant-tables 
例如: 
[mysqld] 
datadir=/var/lib/mysql 
socket=/var/lib/mysql/mysql.sock 
skip-grant-tables 
保存并且退出vi。

3.重新启动mysqld 
# service mysqld restart 
Stopping MySQL: [ OK ] 
Starting MySQL: [ OK ]

4.登录并修改MySQL的root密码 
# mysql 
Welcome to the MySQL monitor. Commands end with ; or \g. 
Your MySQL connection id is 3 to server version: 3.23.56 
Type 'help;' or '\h' for help. Type '\c' to clear the buffer. 
mysql> USE mysql ; 
Database changed 
mysql> UPDATE user SET Password = password ( 'new-password' ) WHERE User = 'root' ; 
Query OK, 0 rows affected (0.00 sec) 
Rows matched: 2 Changed: 0 Warnings: 0 
mysql> flush privileges ; 
Query OK, 0 rows affected (0.01 sec) 
mysql> quit

5.将MySQL的登录设置修改回来 
# vim /etc/my.cnf 
将刚才在[mysqld]的段中加上的skip-grant-tables删除 
保存并且退出vim

6.重新启动mysqld 
# service mysqld restart 
Stopping MySQL: [ OK ] 
Starting MySQL: [ OK ]

准备:先安装mysql,启动mysql服务service mysqld start 或者 service mysql start 登录mysql客户端,创建数据库和表。

mysql> create database spark; 
Query OK, 1 row affected (0.00 sec)

mysql> use spark;
Database changed

mysql> create table wordcount(
    -> word varchar(50) default null,
    -> wordcount int(10) default null
    -> );

项目中要用到MySQL驱动,先添加进pom文件



  mysql
  mysql-connector-java
  5.1.47


import java.sql.DriverManager

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * @ author YuZhansheng
  * @ desc 使用Spark Streaming完成词频统计,并将结果写入到MySQL中
  * @ create 2019-02-20 11:04
  */
object ForeachRDDApp {
    def main(args: Array[String]): Unit = {

        val sparkConf = new SparkConf().setAppName("ForeachRDDApp").setMaster("local[2]")

        val ssc = new StreamingContext(sparkConf,Seconds(5))

        val lines = ssc.socketTextStream("localhost",6789)

        val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

        //result.print()  //此处仅仅是将结果统计输出在控制台

        //TODO 将结果写入到mysql
        result.foreachRDD(rdd => {
            val connection = createConnection()
            rdd.foreach{ record =>
                val sql = "insert into wordcount(word, wordcount) values('"+record._1+"',"+record._2 +")"
                connection.createStatement().execute(sql)
            }
        })

        ssc.start()
        ssc.awaitTermination()
    }

    //获取MySQL的连接
    def createConnection() = {
        Class.forName("com.mysql.jdbc.Driver")
        DriverManager.getConnection("jdbc:mysql://localhost:3306/spark","root","18739548870yu")
    }
}

运行上面代码会出现序列化异常:

19/02/20 11:27:18 ERROR JobScheduler: Error running job streaming job 1550633235000 ms.0
org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:917)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
	at com.xidian.spark.ForeachRDDApp$$anonfun$main$1.apply(ForeachRDDApp.scala:30)
	at com.xidian.spark.ForeachRDDApp$$anonfun$main$1.apply(ForeachRDDApp.scala:28)

修改代码:


import java.sql.DriverManager

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * @ author YuZhansheng
  * @ desc 使用Spark Streaming完成词频统计,并将结果写入到MySQL中
  * @ create 2019-02-20 11:04
  */
object ForeachRDDApp {
    def main(args: Array[String]): Unit = {

        val sparkConf = new SparkConf().setAppName("ForeachRDDApp").setMaster("local[2]")

        val ssc = new StreamingContext(sparkConf,Seconds(5))

        val lines = ssc.socketTextStream("localhost",6789)

        val result = lines.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

        //result.print()  //此处仅仅是将结果统计输出在控制台

        //TODO 将结果写入到mysql
//        result.foreachRDD(rdd => {
//            val connection = createConnection()
//            rdd.foreach{ record =>
//                val sql = "insert into wordcount(word, wordcount) values('"+record._1+"',"+record._2 +")"
//                connection.createStatement().execute(sql)
//            }
//        })  会出现序列化异常

        result.foreachRDD(rdd => {
            rdd.foreachPartition(partitionOfRecords => {
                val connection = createConnection()
                partitionOfRecords.foreach(record => {
                    val sql = "insert into wordcount(word, wordcount) values('"+record._1+"',"+record._2 +")"
                    connection.createStatement().execute(sql)
                })
                connection.close()
            })
        })

        ssc.start()
        ssc.awaitTermination()
    }

    //获取MySQL的连接
    def createConnection() = {
        Class.forName("com.mysql.jdbc.Driver")
        DriverManager.getConnection("jdbc:mysql://localhost:3306/spark","root","18739548870yu")
    }
}

测试:nc -lk 6789
a d ff g h

MySQL表中:

mysql> select * from wordcount;
+------+-----------+
| word | wordcount |
+------+-----------+
| d    |         1 |
| h    |         1 |
| ff   |         1 |
| a    |         1 |
| g    |         1 |
+------+-----------+
5 rows in set (0.00 sec)

此例子存在的问题:

1、对于已有的数据做更新,所有的数据均为insert;

改进思路:再插入之前先判断单词是否已经存在,如果存在就update,不存在则insert。但是在实际生产中,往往使用HBase或者Redis来存储。

2、每个rdd的partition都创建connection,建议改成连接池,提高效率。

你可能感兴趣的:(Spark,大数据相关)