使用 netcat 工具向 9999 端口不断的发送数据,通过 Spark Streaming 读取端口数据并统计不同单词出现的次数
(1)在xml文件中添加streaming的依赖并等待刷新,pom.xml文件内容如下:
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0modelVersion>
<groupId>org.examplegroupId>
<artifactId>SparkWordCountShangguiguartifactId>
<version>1.0-SNAPSHOTversion>
<dependencies>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-core_2.11artifactId>
<version>2.1.1version>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-streaming_2.11artifactId>
<version>2.1.1version>
dependency>
dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.mavengroupId>
<artifactId>scala-maven-pluginartifactId>
<version>3.4.6version>
<executions>
<execution>
<goals>
<goal>compilegoal>
<goal>testCompilegoal>
goals>
execution>
executions>
plugin>
plugins>
build>
project>
(2)新建Object类型的scala文件,写入正式代码:
package com.zchi
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
object wordCountStreaming {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("StreamingWordCount").setMaster("local[*]")
// 1. 创建SparkStreaming的入口对象: StreamingContext 参数2: 表示事件间隔 内部会创建 SparkContext
val ssc = new StreamingContext(conf, Seconds(3))
// 2. 创建一个DStream
val lines: ReceiverInputDStream[String] = ssc.socketTextStream("localhost", 9999)
// 3. 一个个的单词
val words: DStream[String] = lines.flatMap(_.split("""\s+"""))
// 4. 单词形成元组
val wordAndOne: DStream[(String, Int)] = words.map((_, 1))
// 5. 统计单词的个数
val count: DStream[(String, Int)] = wordAndOne.reduceByKey(_ + _)
//6. 显示
println("aaa")
count.print
//7. 开始接受数据并计算
ssc.start()
//8. 等待计算结束(要么手动退出,要么出现异常)才退出主程序
ssc.awaitTermination()
}
}
这里需要注意的是,第13行的hostname需要按需改动,改成数据来源的端口号,我的实验在本地运行,所以hostname填的是localhost
(3)启动Hadoop
cd /usr/local/hadoop
$ sbin/start-all.sh
(4)导入log4j.properties配置文件
因为是流式,运行时控制台会源源不断得输出各种INFO和WARN,这些信息我们用不到,并且会盖过重要信息,所以要导入以下模板来修改控制台的输出。以下文件名叫做log4j.properties.template,复制之后,把log4j.properties.template修改为log4j.properties,放到resources文件夹中:
然后在Ctrl+R把文件中所有的INFO和WARN都替换成ERROR,意思是只打印错误信息。
以下是没有替换过的log4j.properties.template原文件:
# Set everything to be logged to the console
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
(5)
在终端启动netcat
nc -lk 9999
没有任何输出即为启动成功
(5)在IDEA中运行scala文件,然后在控制台写入信息。每隔三秒便会有输出。
在终端的流式输入:
控制台的输出:
-------------------------------------------
Time: 1591494186000 ms
-------------------------------------------
(zhangchi,1)
(my,1)
(is,1)
(name,1)
-------------------------------------------
Time: 1591494189000 ms
-------------------------------------------
-------------------------------------------
Time: 1591494192000 ms
-------------------------------------------
-------------------------------------------
Time: 1591494195000 ms
-------------------------------------------
-------------------------------------------
Time: 1591494198000 ms
-------------------------------------------
-------------------------------------------
Time: 1591494201000 ms
-------------------------------------------
-------------------------------------------
Time: 1591494204000 ms
-------------------------------------------
(im,1)
(university,1)
(from,1)
(henan,1)
(6)至此实验完成。