某大学计算机系的成绩,数据格式如下所示:
Tom,DataBase,80
Tom,Algorithm,50
Tom,DataStructure,60
Jim,DataBase,90
Jim,Algorithm,60
Jim,DataStructure,80
……
请根据给定的实验数据,在 spark-shell 中通过编程来计算以下内容:
(1)该系总共有多少学生;
val lines=sc.textFile("/test/Data1.txt")//打开文件 val par=lines.map(row=>row.split(",")(0))//切分取第一数值 val distinct_par=par.distinct()//去重 distinct_par.count//输出
(2)Tom 同学的总成绩平均分是多少;
val lines=sc.textFile("/test/Data1.txt")//打开文件
val pare=lines.filter(row=>row.split(",")(0)=="Tom")//fileter pare.foreach(println)//输出内容 /*Tom,DataBase,26 Tom,Algorithm,12 Tom,OperatingSystem,16 Tom,Python,40 Tom,Software,60*/ pare.map(row=>(row.split(",")(0),row.split(",")(2).toInt)).mapValues(x=>(x,1)).reduceByKey((x,y)=>(x._1+y._1,x._2+y._2)).mapValues(x=>(x._1/x._2)).collect() //res13: Array[(String, Int)] = Array((Tom,30))
(4)求每名同学的选修的课程门数;
val lines=sc.textFile("/test/Data1.txt") val pare=lines.map(row=>(row.spilt(",")(0),row.split(",")(1))) pare.mapValues(x=>(x,1)).reduceByKey((x,y)=>(" ",x._2+y._2)).mapValues(x =>x._2).foreach(println)
(5)该系 DataBase 课程共有多少人选修;
val pare=lines.filter(row=>row.split(",")(1)=="DataBase") pare.count
(6)各门课程的平均分是多少;
val par=lines.map(row=>(row.split(",")(1),row.split(",")(2).toInt)) par.mapValues(x=>(x,1)).reduceByKey((x,y)=>(x._1+y._1,x._2+y._2)).mapValues(x=>(x._1/x._2)).collect()
(7)使用累加器计算共有多少人选了 DataBase 这门课
val pare=lines.filter(row=>row.split(",")(1)=="DataBase").map(row=>(row.split(",")(1),1)) val accum=sc.longAccumulator("My Accumulator") pare.values.foreach(x=>accum.add(x)) accum.value