一:准备sparkStreamingWordFrep.txt文本文件,内容如下:
this is a processing of the sparkStreaming data learn use I can process spark it big streming
data learn use I can process spark it big streming
to want I can data learn use I can process spark it big streming
二,搭建maven管理的工程
2.11.8
repos
Repository
http://maven.aliyun.com/nexus/content/groups/public
scala-tools.org
Scala-Tools Maven2 Repository
http://scala-tools.org/repo-releases
repos
Repository
http://maven.aliyun.com/nexus/content/groups/public
scala-tools.org
Scala-Tools Maven2 Repository
http://scala-tools.org/repo-releases
org.apache.spark
spark-core_2.11
2.4.0
org.apache.spark
spark-sql_2.11
2.4.0
org.scala-lang
scala-library
${scala.version}
src/main/scala
src/test/scala
org.scala-tools
maven-scala-plugin
2.15.2
scala-compile-first
compile
**/*.scala
scala-test-compile
testCompile
org.apache.maven.plugins
maven-compiler-plugin
3.1
1.8
maven-assembly-plugin
false
jar-with-dependencies
org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordFreq
make-assembly
package
single
org.apache.maven.plugins
maven-jar-plugin
2.4
true
org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordFreq
三:完整代码如下
package org.jy.data.yh.bigdata.drools.scala.sparkstreaming
import org.apache.spark.{SparkConf, SparkContext}
/**
* Spark 文本单词的计数: 从文本中读取数据,利用空格进行数据切分后,按照<词,1>的key,value映射后,
* 最终以key为reduce对象,整合每个单词出现的次数。
*
* 利用flagMap以空格为切分将输入的文本映射为一个向量,之后用map将每个元素映射为(元素,词频)
* 的二元组,最后以每个词元素为key 进行reduce合并操作,从而统计出每个单词出现的词频
* (注:这一步是分散在集群当中的所有Worker中执行的,即每个Worker可能只计算了一小部分)
*
*/
object SparkStreamingWordFreq {
def main(args :Array[String]):Unit = {
// 创建SparkConf对象
val sparkConf = new SparkConf();
sparkConf.setAppName("SparkStreamingWordFreq") // 设置应用名称,该名称在Spark Web Ui中显示
sparkConf.setMaster("local[*]") // 设置本地模式
// 创建SparkContext对象
val sparkContext = new SparkContext(sparkConf);
// 数据源为文本文件
val txtFile = "D://jar/sparkStreamingWordFrep.txt"
// 读取文本文件的内容
val txtData = sparkContext.textFile(txtFile);
// 缓存文本RDD
txtData.cache()
// 计数
txtData.count()
// 以空格分割进行词频统计
val wcData = txtData.flatMap{line => line.split(" ")}
.map(word => (word,1))
.reduceByKey(_+_)
// 汇总RDD信息并打印
wcData.collect() // 返回一个数组
.foreach(e=>{println(e)})
sparkContext.stop()
}
}
四,在IntelliJ IDEA中运行该程序,输出统计结果如下:
(learn,3)
(this,1)
(is,1)
(can,4)
(big,3)
(data,3)
(,1)
(want,1)
(it,3)
(spark,3)
(process,3)
(a,1)
(streming,3)
(processing,1)
(sparkStreaming,1)
(I,4)
(to,1)
(of,1)
(use,3)
(the,1)