spark 关于数据格式的清洗

需求: 原本的日志格式

183.136.128.154 - - [30/Jul/2016:10:56:24 +0800] "GET http://static.tx.wmpyol.com/play/play.html HTTP/1.1" 200 651 "-" "Go-http-client/1.1" Hit "C/200" Static "max-age=60" 0.115 59.49.85.145

要求的日志的格式

183.136.128.154 - - [30/Jul/2016:10:56:48 +0800] "GET /play/play.html HTTP/1.1" 200 651 "-" "Go-http-client/1.1" http://static.tx.wmpyol.com V1

url的分割  
  def subOfdata(url: String, isbase: Int): String = {
      val ff = url.split("/")
      val baseUrl = ff(0) + "//" + ff(2)
      val kk = url.substring(baseUrl.length)
      if(isbase == 1) {
    return  baseUrl;
     }else{
          return kk;
      }
  }


val data = sc.textFile("文件url")

data.map(_.split(" ")).map(x => (x(0)+" "+x(1)+" "+x(2)+" "+x(3)+" "+x(4)+" "+x(5)+" "+subOfdata(x(6),2)+" "+x(7)+" "+x(8)+" "+x(9)+" "+x(10)+" "+x(11)+" "+subOfdata(x(6),1)+" "+"V1")).saveAsTextFile("/usr/local/spark/spark/work/jjj")

你可能感兴趣的:(scala,spark)