spark业务开发-数据清洗

spark业务开发-数据清洗

  • 项目地址:https://gitee.com/cch-bigdata/spark-process.git

输入数据

order_number,order_date,purchaser,quantity,product_id,remark
10001,2016-01-16,1001,1,102,机q器w记e录r
10003,2016-01-17,1002,2,105,人工记录
10002,2016-01-19,1002,3,106,人工补录
10004,2016-02-21,1003,4,107,自然交易
10001,2016-01-16,1001,1,102,机器记录

输出数据

+------------+-------------------+---------+--------+----------+--------+
|order_number|         order_date|purchaser|quantity|product_id|  remark|
+------------+-------------------+---------+--------+----------+--------+
|       10001|2016-01-16 00:00:00|     1001|       1|       102|机器记录|
|       10003|2016-01-17 00:00:00|     1002|       2|       105|人工记录|
|       10002|2016-01-19 00:00:00|     1002|       3|       106|人工补录|
|       10004|2016-02-21 00:00:00|     1003|       4|       107|自然交易|
|       10001|2016-01-16 00:00:00|     1001|       1|       102|机器记录|
+------------+-------------------+---------+--------+----------+--------+

程序代码

package com.cch.bigdata.spark.process.clean

import com.cch.bigdata.spark.process.AbstractTransform
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._

class Cleaner extends AbstractTransform{


  case class CleanColumn(name:String, strategy:String, symbol:String)

  //需要清洗的列
  private val columns = Array("remark")

  //清洗规则
  private val clean_rules = Array("5")

  //指定标点符号
  private val symbols = Array()




  override def process(): Unit = {
    if(columns.isEmpty){
      throw new RuntimeException("清洗列配置不能为空!")
    }

    if(clean_rules.isEmpty){
      throw new RuntimeException("清洗列清洗规则配置不能为空!")
    }

    //获取数据集
    var df: DataFrame = loadCsv("src/main/resources/csv/orders.csv",spark)

    var index = 0
    columns.foreach(c=>{
      var symbol = ""
      try {
        symbol = symbols(index)
      }catch {
        case e:ArrayIndexOutOfBoundsException=>{
          //什么也不做
        }
      }
      val cleanColumn: CleanColumn = CleanColumn(c,clean_rules(index),symbol)


      if(cleanColumn.name.isEmpty){
        throw new RuntimeException("清洗列名称未配置!")
      }

      if(cleanColumn.strategy==null){
        throw new RuntimeException("清洗策略配置不能为空!")
      }

      if(cleanColumn.strategy==3){
        if(cleanColumn.symbol.isEmpty){
          throw new RuntimeException("清洗策略为标点符号清除,但标点符号未配置!")
        }
      }

      cleanColumn.strategy match {
        case "1" =>{
          //清除所有空格
          df = df.withColumn(cleanColumn.name,regexp_replace(col(cleanColumn.name),"\\s+",""))
        }

        case "2" =>{
          //清除首尾空格
          df = df.withColumn(cleanColumn.name,trim(col(cleanColumn.name)))
        }

        case "3" =>{
          //清除标点符号
          val symbolRegexp = "[\\\\"+cleanColumn.symbol+"]"
          df = df.withColumn(cleanColumn.name,regexp_replace(col(cleanColumn.name),symbolRegexp,""))
        }

        case "4" =>{
          //清除数字
          df = df.withColumn(cleanColumn.name,regexp_replace(col(cleanColumn.name),"\\d+",""))
        }

        case "5" =>{
          //清除字母
          df = df.withColumn(cleanColumn.name,regexp_replace(col(cleanColumn.name),"\\w+",""))
        }

        case "6" =>{
          //设为小写
          df = df.withColumn(cleanColumn.name,lower(col(cleanColumn.name)))
        }

        case "7" =>{
          //设为大写
          df = df.withColumn(cleanColumn.name,upper(col(cleanColumn.name)))
        }

        case "8" =>{
          //首字母大写
          //?? 首字母大写后,其后保持不变,还是都是小写
          df = df.withColumn(cleanColumn.name, initcap(col(cleanColumn.name)))
        }
      }
      index+=1
    })

    df.show()

  }

  override def getAppName(): String = "数据清洗"
}


package com.cch.bigdata.spark.process

import com.cch.bigdata.spark.process.clean.Cleaner

object CleanerTest {

  def main(args: Array[String]): Unit = {
    new Cleaner().process()
  }
}


参数解释

  • columns:需要清洗的列名,字符串数组
  • strategy:清洗策略,字符串数组
    • 1代表清除所有空格
    • 2代表清除首尾空格
    • 3代表清除标点符号,所以此时,需要在列配置中添加symbol属性,值为需要被清洗的符号
    • 4代表清除数字
    • 5代表清除字母
    • 6代表设为小写
    • 7代表设为大写
    • 8代表首字母大写
  • symbol:当策略属性值为3时,这个属性需要配置对应的内容(标点符号),索引需要与字段一直

你可能感兴趣的:(spark业务开发,大数据,spark,spark,big,data,大数据)