Spark RDD文件词频统计

一:准备sparkStreamingWordFrep.txt文本文件,内容如下:

this is a processing of the sparkStreaming data learn use I can process spark it big streming
data learn use I can process spark it big streming 
to want I can  data learn use I can process spark it big streming
 

二,搭建maven管理的工程


        2.11.8
    

    
        
            repos
            Repository
            http://maven.aliyun.com/nexus/content/groups/public
        
        
            scala-tools.org
            Scala-Tools Maven2 Repository
            http://scala-tools.org/repo-releases
        
    

    
        
            repos
            Repository
            http://maven.aliyun.com/nexus/content/groups/public
        
        
            scala-tools.org
            Scala-Tools Maven2 Repository
            http://scala-tools.org/repo-releases
        
    

    
         
            org.apache.spark
            spark-core_2.11
            2.4.0
        
         
            org.apache.spark
            spark-sql_2.11
            2.4.0
        
        
        
            org.scala-lang
            scala-library
            ${scala.version}
        
    

    
        src/main/scala
        src/test/scala
        
            
                org.scala-tools
                maven-scala-plugin
                2.15.2
                
                    
                        scala-compile-first
                        
                            compile
                        
                        
                            
                                **/*.scala
                            
                        
                    
                    
                        scala-test-compile
                        
                            testCompile
                        
                    
                
            

            
                org.apache.maven.plugins
                maven-compiler-plugin
                3.1
                
                    1.8
                    1.8
                
            

            
                maven-assembly-plugin
                
                    false
                    
                        jar-with-dependencies
                    
                    
                        
                            org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordFreq
                        
                    
                
                
                    
                        make-assembly
                        package
                        
                            single
                        
                    
                
            
            
                org.apache.maven.plugins
                maven-jar-plugin
                2.4
                
                    
                        
                            true
                            org.jy.data.yh.bigdata.drools.scala.sparkstreaming.SparkStreamingWordFreq
                        
                    
                
            
        
    

三:完整代码如下

package org.jy.data.yh.bigdata.drools.scala.sparkstreaming
import org.apache.spark.{SparkConf, SparkContext}
/**
 *  Spark 文本单词的计数: 从文本中读取数据,利用空格进行数据切分后,按照<词,1>的key,value映射后,
  * 最终以key为reduce对象,整合每个单词出现的次数。
  * 
  * 利用flagMap以空格为切分将输入的文本映射为一个向量,之后用map将每个元素映射为(元素,词频)
  * 的二元组,最后以每个词元素为key 进行reduce合并操作,从而统计出每个单词出现的词频
  * (注:这一步是分散在集群当中的所有Worker中执行的,即每个Worker可能只计算了一小部分)
 *
 */
object SparkStreamingWordFreq {
    def main(args :Array[String]):Unit = {
      // 创建SparkConf对象
      val sparkConf = new SparkConf();
      sparkConf.setAppName("SparkStreamingWordFreq") // 设置应用名称,该名称在Spark Web Ui中显示
      sparkConf.setMaster("local[*]") // 设置本地模式
      // 创建SparkContext对象
      val sparkContext = new SparkContext(sparkConf);
      // 数据源为文本文件
      val txtFile = "D://jar/sparkStreamingWordFrep.txt"
      // 读取文本文件的内容
      val txtData = sparkContext.textFile(txtFile);
      // 缓存文本RDD
      txtData.cache()
      // 计数
      txtData.count()
      // 以空格分割进行词频统计
      val wcData = txtData.flatMap{line => line.split(" ")}
          .map(word => (word,1))
          .reduceByKey(_+_)
          // 汇总RDD信息并打印
      wcData.collect() // 返回一个数组
        .foreach(e=>{println(e)})
      sparkContext.stop()

    }
}

四,在IntelliJ IDEA中运行该程序,输出统计结果如下:

(learn,3)
(this,1)
(is,1)
(can,4)
(big,3)
(data,3)
(,1)
(want,1)
(it,3)
(spark,3)
(process,3)
(a,1)
(streming,3)
(processing,1)
(sparkStreaming,1)
(I,4)
(to,1)
(of,1)
(use,3)
(the,1)

 

你可能感兴趣的:(Spark,RDD(弹性分布式数据集))