从头学习Spark:SparkStreaming编程实践_WordCount

WordCount例子

需求介绍:

使用 netcat 工具向 9999 端口不断的发送数据,通过 Spark Streaming 读取端口数据并统计不同单词出现的次数

过程

(1)在xml文件中添加streaming的依赖并等待刷新,pom.xml文件内容如下:


<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0modelVersion>

    <groupId>org.examplegroupId>
    <artifactId>SparkWordCountShangguiguartifactId>
    <version>1.0-SNAPSHOTversion>

    <dependencies>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-core_2.11artifactId>
            <version>2.1.1version>
        dependency>

        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-streaming_2.11artifactId>
            <version>2.1.1version>
        dependency>
    dependencies>



    <build>
        <plugins>
            
            <plugin>
                <groupId>net.alchim31.mavengroupId>
                <artifactId>scala-maven-pluginartifactId>
                <version>3.4.6version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compilegoal>
                            <goal>testCompilegoal>
                        goals>
                    execution>
                executions>
            plugin>
        plugins>
    build>

project>

(2)新建Object类型的scala文件,写入正式代码:

package com.zchi

import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}

object wordCountStreaming {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("StreamingWordCount").setMaster("local[*]")
    // 1. 创建SparkStreaming的入口对象: StreamingContext  参数2: 表示事件间隔   内部会创建 SparkContext
    val ssc = new StreamingContext(conf, Seconds(3))
    // 2. 创建一个DStream
    val lines: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 9999)
    // 3. 一个个的单词
    val words: DStream[String] = lines.flatMap(_.split("""\s+"""))
    // 4. 单词形成元组
    val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
    // 5. 统计单词的个数
    val count: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
    //6. 显示
    println("aaa")
    count.print
    //7. 开始接受数据并计算
    ssc.start()
    //8. 等待计算结束(要么手动退出,要么出现异常)才退出主程序
    ssc.awaitTermination()
  }
}

这里需要注意的是,第13行的hostname需要按需改动,改成数据来源的端口号,我的实验在本地运行,所以hostname填的是localhost
(3)启动Hadoop

 cd /usr/local/hadoop
$ sbin/start-all.sh

(4)导入log4j.properties配置文件

因为是流式,运行时控制台会源源不断得输出各种INFO和WARN,这些信息我们用不到,并且会盖过重要信息,所以要导入以下模板来修改控制台的输出。以下文件名叫做log4j.properties.template,复制之后,把log4j.properties.template修改为log4j.properties,放到resources文件夹中:
从头学习Spark:SparkStreaming编程实践_WordCount_第1张图片
然后在Ctrl+R把文件中所有的INFO和WARN都替换成ERROR,意思是只打印错误信息。

以下是没有替换过的log4j.properties.template原文件:

# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

(5)
在终端启动netcat

 nc -lk 9999

没有任何输出即为启动成功

(5)在IDEA中运行scala文件,然后在控制台写入信息。每隔三秒便会有输出。
在终端的流式输入:
从头学习Spark:SparkStreaming编程实践_WordCount_第2张图片
控制台的输出:
从头学习Spark:SparkStreaming编程实践_WordCount_第3张图片

-------------------------------------------
Time: 1591494186000 ms
-------------------------------------------
(zhangchi,1)
(my,1)
(is,1)
(name,1)

-------------------------------------------
Time: 1591494189000 ms
-------------------------------------------

-------------------------------------------
Time: 1591494192000 ms
-------------------------------------------

-------------------------------------------
Time: 1591494195000 ms
-------------------------------------------

-------------------------------------------
Time: 1591494198000 ms
-------------------------------------------

-------------------------------------------
Time: 1591494201000 ms
-------------------------------------------

-------------------------------------------
Time: 1591494204000 ms
-------------------------------------------
(im,1)
(university,1)
(from,1)
(henan,1)

(6)至此实验完成。

参考

  1. Spark log4j日志配置详解
  2. 尚硅谷spark案例
  3. (仍有问题)Hadoop中hostname和/etc/hosts配置文件的关系
  4. Linux Netcat 命令——网络工具中的瑞士军刀

你可能感兴趣的:(大数据)