需求:从一个集合中计算出每个单词的个数,并输出前三的单词
集合中的形式:
val stringList: List[String] = List( "hello world", "beautiful city", "I a chinese", "you is a good man", "I is a kind woman", "thanks for you", "forgive you", "hello world,I am student in a beautiful city", "whatever I will keep happy" )
1. 先对每一行以空格切割,获得每一个单词,此时返回的是一个字符串数组。
2. 将每个字符串数组扁平化处理
3. 对单词进行分组操作
4.统计
// 简单版本:单词计数:将集合中出现的相同的单词统计其个数
val stringList: List[String] = List(
"hello world",
"beautiful city",
"I a chinese",
"you is a good man",
"I is a kind woman",
"thanks for you",
"forgive you",
"hello world,I am student in a beautiful city",
"whatever I will keep happy"
)
// 1. 将单词按空格切割
val stringSplit: List[Array[String]] = stringList.map((strings) => {
strings.split(" ")
})
println(stringSplit)
// 2. 扁平化
val stringListFlatten: List[String] = stringSplit.flatten
println(stringListFlatten)
// 3.将相同的单词放在一起 groupby
val stringGroupBy: Map[String, List[String]] = stringListFlatten.groupBy((word) => word)
println(stringGroupBy)
// 4.统计相同元素的个数,并返回map
val stringCountList: Map[String, Int] = stringGroupBy.map((kv) => (kv._1, kv._2.size))
// 5.排序 取出前三个
val toList = stringCountList.toList
val resultList = toList.sortWith((x, y) => {
x._2 > y._2
}).take(3)
println(resultList)
与简单版本不一样的是复杂版本中,它的集合格式如下:数字代表该字符串重复的次数,这里提供两种解法。
val tupleList: List[(String, Int)] = List((("Hello Scala Spark World"), 7), (("Hello Scala"), 3), (("Hello china"), 5)
它先把集合中的字符串转成如下格式(简单版中的格式),其他的步骤与简单版一致:
val tupleList: List[String] = List("hello world", "beautiful city")
// 方法一:(不通用)
val tupleList: List[(String, Int)] = List((("Hello Scala Spark World"), 7), (("Hello Scala"), 3), (("Hello china"), 5))
tupleList.map((elem) => (elem._1 + " ") * elem._2)
.flatMap(_.split(" "))
.groupBy(word => word)
.map((kv) => (kv._1, kv._2.length))
.toList.sortWith(_._2 > _._2)
.take(3)
.foreach(println)
这种方法就是先计算每个元组中单词的个数,再进行累加即可
// 方法二:先计算每个元组中单词的个数,再把相同的key的value累加起来
// ("Hello Scala Spark World"), 7)
// ("Hello",7) ("Scala",7) ("Spark",7) ("World", 7)
val tupleList: List[(String, Int)] = List((("Hello Scala Spark World"), 7),
(("Hello Scala"), 3), (("Hello china"), 5))
val wordToCountList: List[(String, Int)] = tupleList.flatMap(t => {
val strings: Array[String] = t._1.split(" ")
strings.map(word => (word, t._2))
})
println(wordToCountList)
// 分组
val wordGroupBy: Map[String, List[(String, Int)]] = wordToCountList.groupBy(_._1)
println(wordGroupBy)
// 把数字合并成列表 类似于(“hello” => List(7,7,7))
val wordToCountMap = wordGroupBy.map(t => {
(t._1, t._2.map(t1 => t1._2))
})
val wordToTotalCountMap:Map[String, Int] = wordToCountMap.map(t => (t._1,t._2.sum))
println(wordToTotalCountMap)
wordToTotalCountMap
.toList
.sortWith(_._2 > _._2)
.take(3)
.foreach(println)