(十一)Spark Core求每个域名的流量

原始文件baidu.log内容:

hello   world   spark   hadoop  hive    223.104.18.110  v1.go2yd.com    17168
world   spark   hello   mysql   sqoop   113.101.75.194  v2.go2yd.com    17222
spark   hello   mysql   hive    world   27.17.127.135   v2.go2yd.com    1556
......
......

需求一:求每个域名的流量
步骤:
1)获取到日志中每条记录的域名和流量
2)按照域名进行分组并求流量的和

object test {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val lines = sc.textFile("file:///E:/BigDataSoftware/data/baidu.log")

    val splits = lines.map(x=> {
         val tmp = x.split("\t")
         (tmp(6),tmp(7).toLong)  //取出域名和流量两列,并注意将流量转成Long类型,不然会变成字符串相加
       })
    val results = splits.reduceByKey(_+_)
    results.take(10).foreach(println)
    
sc.stop()
  }
}

如果流量数据中有字符串等脏数据时,toLong报错导致任务失败,报错如下

java.lang.NumberFormatException: For input string: "10701str"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Long.parseLong(Long.java:589)
    at java.lang.Long.parseLong(Long.java:631)
    at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:276)
    at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
    at www.ruozedata.test$$anonfun$2.apply(test.scala:20)
.......

解决办法:加入一个判断

object test {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val lines = sc.textFile("file:///E:/BigDataSoftware/data/baidu.log")
  
  val splits = lines.map(x=> {
      val tmp = x.split("\t")
      var traffic = 0L
      try {
        traffic = tmp(7).toLong
      } catch {
        case e:Exception => traffic = 0L
      }
      (tmp(6),traffic)
    })
    val results = splits.reduceByKey(_+_)
    results.take(10).foreach(println)
    
sc.stop()
  }
}
输出结果:
------------------------------------------------
(v2.go2yd.com,38034)
(v4.go2yd.com,31969)
(v3.go2yd.com,44598)
(v1.go2yd.com,77995)

但是发现,有一条带空格的数据没有被算进去,损失一条数据,加个trim即可解决

object test {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("test").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val lines = sc.textFile("file:///E:/BigDataSoftware/data/baidu.log")

    val splits = lines.map(x=> {
      val tmp = x.split("\t")
      var traffic = 0L
      try {
        traffic = tmp(7).trim.toLong
      } catch {
        case e:Exception => traffic = 0L
      }
      (tmp(6),traffic)
    })
    val results = splits.reduceByKey(_+_)
    results.take(10).foreach(println)
    
sc.stop()
  }
}
输出结果:
------------------------------------------------
(v2.go2yd.com,38034)
(v4.go2yd.com,31969)
(v3.go2yd.com,51784)
(v1.go2yd.com,77995)

最后一条的流量加进来了

你可能感兴趣的:((十一)Spark Core求每个域名的流量)