1、Wordcount程序测试:进入spark-shell中
val text_file =sc.textFile("hdfs://hadoop1:8020/ai/README.txt")
val counts =text_file.flatMap(line=>line.split("")).map(word=>(word,1)).reduceByKey(_+_)
counts.saveAsTextFile("hdfs://hadoop1:8020/ai/wordcount") 注意:Wordcount为运算结果的目录
Wordcount的简写方式:scala> valwordCount = rdd.flatMap(_.split(' ')).map((_,1)).reduceByKey(_+_)
counts.saveAsTextFile("hdfs://hadoop1:8020/ai/wordcount")
查看结果:scala> wordCount.collect
求总行数:val rows =sc.textFile("/ai/README.txt").count
求所有单词的总和Wordsum:val workSum = sc.textFile("hdfs://hadoop1:8020/ai/README.txt").map(_.size).reduce(_+_)
求每行的单词数:val rowSum =sc.textFile("/ai/README.txt").map(_.size)
求每行单词最多的数量:val rowSum =sc.textFile("/ai/README.txt").map(_.size).reduce((a,b) =>if(a>b) a else b)
排序后保存到HDFS上:scala> val wordCount =sc.textFile("/ai/README.txt").flatMap(_.split('')).map((_,1)).reduceByKey(_+_).map(x => (x._2, x._1)).sortByKey(false).map(x=> (x._2, x._1)).saveAsTextFile("/ai/wordSorted")
对生成的小文件合并到本地:hadoopfs -getmerge /ai/wordSorted /home/hadoop/wordCount.txt
运行结果如下:
[hadoop@hadoop2 ~]$hadoop fs -ls /ai/wordcount
Found 2 items
-rw-r--r-- 3 hadoopsupergroup 0 2015-05-11 20:55/ai/wordcount/_SUCCESS 注意:此处为运算成功的标志
-rw-r--r-- 3 hadoopsupergroup 1574 2015-05-11 20:55 /ai/wordcount/part-00000 注意:此处为运算的结果
2、rdd操作:常用例子
1)查询一行中单词最多的个数:sc.textFile("/ai/README.md").map(_.split("").size).reduce((a,b) => if (a>b) a else b)
另一种写法为导入java函数:import java.lang.Math sc.textFile("/ai/README.md").map(_.split("").size).reduce((a, b) => Math.max(a,b))
2)计算HDFS上 /ai/README.txt文件的行数:scala> val count = sc.textFile("/ai/README.txt").count 等价于val count = sc.textFile("hdfs:///ai/README.txt").count 和val count = sc.textFile("hdfs://hadoop1:8020/ai/README.txt").count
3)查看包含hadoop的记录:sc.textFile("/ai/README.txt").filter(_.contains("hadoop")).collect