Spark RDD编程案例

1.获取弹幕文件中带!的弹幕

由于!有些是中英文的,所以filter需要一个||

// 获取弹幕中带感叹号的,无论大小写
var lines = sc.textFile("file:///root/Desktop/barrage.json")

var lines_after= lines.filter(line=>(line.contains("!")) || (line.contains("!")))

// 去除空格等多余字符
var result = lines_after.map(x=>x.trim.substring(1,x.trim.length-2))

// 保存到文件
result.saveAsTextFile("file:///root/Desktop/result/result1")

2.文件排序

var readline = sc.textFile("file:///disk4/bigdata")

var result = readline.filter(x => x.trim().length>0).map(x=>(x.toInt,"")).partitionBy(new HashPartitioner(1)).sortByKey().map(x =>x._1)

result.collect.foreach(println)

result.saveAsTextFile("file:///disk4/bigdata2")

3.求最大值

var readline = sc.textFile("hdfs://node1:9000/bigdata")

var result = readline.map(x=>(x.toInt,x)).sortByKey(false).take(1).map(x=>x._2.toInt)

for(x <- lines ){println(x)}

val rdd = sc.parallelize(lines)

rdd.saveAsTextFile("file:///disk/ssss")

思考一下为什么2需要合并为一个分区,而3不需要?

 

4.键值对中values的处理

1)mapValues

val text = sc.parallelize(Array(("a",1),("b",2)))
val result = text.mapValues(x => x+1)

2)map

val text = sc.parallelize(Array(("a",1),("b",2)))
val result = text.map(x => (x._1,x._2+1))

 

你可能感兴趣的:(Spark)