Spark分布式大数据处理引擎"前哨兵"Scala,很有打Linux命令行的感觉,对运维来说很舒服. 测试数据很简单,就一句英文(
In the face of the Committee's threatened contempt vote), path修改成自己文件所在的路径,即可进行测试程序.
package com.scala.practice
import scala.io.Source
object WordCount {
def main(args: Array[String]): Unit = {
/**
* 读取文件,获得每一行的数据,转换成列表
* List("In the face of the Committee's threatened contempt vote.)
*/
val lines = Source.fromFile("/path/news.txt").getLines().toList
/**
* 按照空格切分句子,切成单词,放在列表中
* List("In, the, face, of, the, Committee's, threatened, contempt, vote.)
*/
val wordList = lines.flatMap(_.split(" "))
/**
* 对切分的单词进行计数
* List(("In,1), (the,1), (face,1), (of,1), (the,1), (Committee's,1), (threatened,1), (contempt,1), (vote.,1))
*/
val countList = wordList.map((_, 1))
/**
* 对单词进行分组, 相同的放在一个列表中
* Map(Committee's -> List((Committee's,1)), threatened -> List((threatened,1)), contempt -> List((contempt,1)),
* "In -> List(("In,1)), vote. -> List((vote.,1)), face -> List((face,1)), of -> List((of,1)), the -> List((the,1), (the,1)))
*/
val grouping = countList.groupBy(_._1)
/**
* 组内聚合,相同的单词个数相加, 去除(replace)多余的掉标点符号
* Map(Committee's -> 1, threatened -> 1, In -> 1, contempt -> 1, face -> 1, vote -> 1, of -> 1, the -> 2)
*/
val aggregation = grouping.map(x => (x._1.replace(".", "").replace("\"", ""), x._2.map(_._2).sum))
/**
* 按照单词个数进行倒排序
* List((the,2), (Committee's,1), (threatened,1), (In,1), (contempt,1), (face,1), (vote,1), (of,1))
*/
val sorting = aggregation.toList.sortBy(-_._2) // 对统计结果按照个数多少排序(倒序)
sorting.foreach(println)
/**
* 优化简写, 可用length求长度,也可用sum求和,视具体情况而定
*/
lines.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map(x => (x._1, x._2.length)).toList.sortBy(-_._2).foreach(println)
lines.flatMap(_.split(" ")).map((_, 1)).groupBy(_._1).map(x => (x._1, x._2.map(_._2).sum)).toList.sortBy(-_._2).foreach(println)
}
}