初学Scala,单词计数小案例

 Spark分布式大数据处理引擎"前哨兵"Scala,很有打Linux命令行的感觉,对运维来说很舒服. 测试数据很简单,就一句英文(

In the face of the Committee's threatened contempt vote), path修改成自己文件所在的路径,即可进行测试程序.
package com.scala.practice

import scala.io.Source

object WordCount {

  def main(args: Array[String]): Unit = {

    /**
      * 读取文件,获得每一行的数据,转换成列表
      * List("In the face of the Committee's threatened contempt vote.)
      */
    val lines = Source.fromFile("/path/news.txt").getLines().toList

    /**
      * 按照空格切分句子,切成单词,放在列表中
      * List("In, the, face, of, the, Committee's, threatened, contempt, vote.)
      */
    val wordList = lines.flatMap(_.split(" "))

    /**
      * 对切分的单词进行计数
      * List(("In,1), (the,1), (face,1), (of,1), (the,1), (Committee's,1), (threatened,1), (contempt,1), (vote.,1))
      */
    val countList = wordList.map((_, 1))

    /**
      * 对单词进行分组, 相同的放在一个列表中
      * Map(Committee's -> List((Committee's,1)), threatened -> List((threatened,1)), contempt -> List((contempt,1)),
      * "In -> List(("In,1)), vote. -> List((vote.,1)), face -> List((face,1)), of -> List((of,1)), the -> List((the,1), (the,1)))
      */
    val grouping = countList.groupBy(_._1)

    /**
      * 组内聚合,相同的单词个数相加, 去除(replace)多余的掉标点符号
      * Map(Committee's -> 1, threatened -> 1, In -> 1, contempt -> 1, face -> 1, vote -> 1, of -> 1, the -> 2)
      */
    val aggregation = grouping.map(x => (x._1.replace(".", "").replace("\"", ""), x._2.map(_._2).sum))

    /**
      * 按照单词个数进行倒排序
      * List((the,2), (Committee's,1), (threatened,1), (In,1), (contempt,1), (face,1), (vote,1), (of,1))
      */
    val sorting = aggregation.toList.sortBy(-_._2) // 对统计结果按照个数多少排序(倒序)

    sorting.foreach(println)

    /**
      * 优化简写, 可用length求长度,也可用sum求和,视具体情况而定
      */
    lines.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map(x => (x._1, x._2.length)).toList.sortBy(-_._2).foreach(println)
    lines.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map(x => (x._1, x._2.map(_._2).sum)).toList.sortBy(-_._2).foreach(println)

  }

}

 

你可能感兴趣的:(Spark)