spark业务开发-数据清洗
- 项目地址:https://gitee.com/cch-bigdata/spark-process.git
输入数据
order_number,order_date,purchaser,quantity,product_id,remark
10001,2016-01-16,1001,1,102,机q器w记e录r
10003,2016-01-17,1002,2,105,人工记录
10002,2016-01-19,1002,3,106,人工补录
10004,2016-02-21,1003,4,107,自然交易
10001,2016-01-16,1001,1,102,机器记录
输出数据
+------------+-------------------+---------+--------+----------+--------+
|order_number| order_date|purchaser|quantity|product_id| remark|
+------------+-------------------+---------+--------+----------+--------+
| 10001|2016-01-16 00:00:00| 1001| 1| 102|机器记录|
| 10003|2016-01-17 00:00:00| 1002| 2| 105|人工记录|
| 10002|2016-01-19 00:00:00| 1002| 3| 106|人工补录|
| 10004|2016-02-21 00:00:00| 1003| 4| 107|自然交易|
| 10001|2016-01-16 00:00:00| 1001| 1| 102|机器记录|
+------------+-------------------+---------+--------+----------+--------+
程序代码
package com.cch.bigdata.spark.process.clean
import com.cch.bigdata.spark.process.AbstractTransform
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
class Cleaner extends AbstractTransform{
case class CleanColumn(name:String, strategy:String, symbol:String)
private val columns = Array("remark")
private val clean_rules = Array("5")
private val symbols = Array()
override def process(): Unit = {
if(columns.isEmpty){
throw new RuntimeException("清洗列配置不能为空!")
}
if(clean_rules.isEmpty){
throw new RuntimeException("清洗列清洗规则配置不能为空!")
}
var df: DataFrame = loadCsv("src/main/resources/csv/orders.csv",spark)
var index = 0
columns.foreach(c=>{
var symbol = ""
try {
symbol = symbols(index)
}catch {
case e:ArrayIndexOutOfBoundsException=>{
}
}
val cleanColumn: CleanColumn = CleanColumn(c,clean_rules(index),symbol)
if(cleanColumn.name.isEmpty){
throw new RuntimeException("清洗列名称未配置!")
}
if(cleanColumn.strategy==null){
throw new RuntimeException("清洗策略配置不能为空!")
}
if(cleanColumn.strategy==3){
if(cleanColumn.symbol.isEmpty){
throw new RuntimeException("清洗策略为标点符号清除,但标点符号未配置!")
}
}
cleanColumn.strategy match {
case "1" =>{
df = df.withColumn(cleanColumn.name,regexp_replace(col(cleanColumn.name),"\\s+",""))
}
case "2" =>{
df = df.withColumn(cleanColumn.name,trim(col(cleanColumn.name)))
}
case "3" =>{
val symbolRegexp = "[\\\\"+cleanColumn.symbol+"]"
df = df.withColumn(cleanColumn.name,regexp_replace(col(cleanColumn.name),symbolRegexp,""))
}
case "4" =>{
df = df.withColumn(cleanColumn.name,regexp_replace(col(cleanColumn.name),"\\d+",""))
}
case "5" =>{
df = df.withColumn(cleanColumn.name,regexp_replace(col(cleanColumn.name),"\\w+",""))
}
case "6" =>{
df = df.withColumn(cleanColumn.name,lower(col(cleanColumn.name)))
}
case "7" =>{
df = df.withColumn(cleanColumn.name,upper(col(cleanColumn.name)))
}
case "8" =>{
df = df.withColumn(cleanColumn.name, initcap(col(cleanColumn.name)))
}
}
index+=1
})
df.show()
}
override def getAppName(): String = "数据清洗"
}
package com.cch.bigdata.spark.process
import com.cch.bigdata.spark.process.clean.Cleaner
object CleanerTest {
def main(args: Array[String]): Unit = {
new Cleaner().process()
}
}
参数解释
- columns:需要清洗的列名,字符串数组
- strategy:清洗策略,字符串数组
- 1代表清除所有空格
- 2代表清除首尾空格
- 3代表清除标点符号,所以此时,需要在列配置中添加symbol属性,值为需要被清洗的符号
- 4代表清除数字
- 5代表清除字母
- 6代表设为小写
- 7代表设为大写
- 8代表首字母大写
- symbol:当策略属性值为3时,这个属性需要配置对应的内容(标点符号),索引需要与字段一直