因为我测试了不同的开发环境会出现莫名的错误,所以可用环境版本说明如下:
IntelliJ IDEA 2019.1.1 (Ultimate Edition)
JRE: 1.8.0_202-release-1483-b44 x86_64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
macOS 10.14.4
ProjectStruct->Libraries: spark-2.3.3-bin-hadoop2.7
Global Libraries: scala-2.11.11
假设当前表中有两列字段,姓名name和性别sex, 数据若干条, 会有重复姓名出现,首先先将表中所有姓名添加一个唯一的ID后缀生成新的字段列name_ID, 为了区分重复的姓名都有唯一的一个字段, 另外要标记一列全局出现重复姓名人数的次数,通过人员的姓名来排序,记录重复了多少次rowNum。
程序使用一个自定义函数来实现,为某个人名添加一个唯一的ID后缀,通过udf方式来调用,另外使用了开窗函数row_number over (partition by name order by sex)来为重复提姓名添加一个自增列,表示重复了多少个
import org.apache.spark.sql.types.{
StringType, StructField, StructType}
import org.apache.spark.sql.{
Row, SparkSession}
import org.apache.spark.{
SparkConf, SparkContext}
object UDF {
var AID = 1000
//自定义函数udf调用此函数来添加唯一的ID后缀
def appendID(s:String):String={
AID += 1
s+"_"+AID
}
def main(args: Array[String]):Unit={
val conf = new SparkConf().setMaster("local").setAppName("UDF")
val ss = new SparkContext(conf)
val spark= SparkSession.builder().config(conf).getOrCreate()
//创建模拟测试数据,包含2个字段,姓名和性别
val testData = Array(
("xuan","male"),
("lee","male"),
("lee","male"),
("lee","Female"),
("lee","Female"),
("jack","Female"),
("john","Female"),
("marry","Female"),
("lee","male"))
//将测试数据并行化后生成Row类型的RDD
val testDataRDD = ss.parallelize(testData, 2).map(r=>Row(r._1,r._2))
//创建测试数据对应的数据结构
val testSchema = StructType(Array(StructField("name",StringType,true),StructField("sex",StringType,true)))
//根据测试数据和数据结构生成DataFrame
val testDF = spark.createDataFrame(testDataRDD,testSchema)
//创建DataFrame对应的临时表
testDF.createOrReplaceTempView("testTableView")
//注册一个UDF自定义函数,用来添加一个唯一ID的后缀
spark.udf.register("appendID",(name:String)=>appendID(name))
//执行spark的SQL查询语句,调用已经定义的udf并使用开窗函数
val resDF = spark.sql("select name,appendID(name) as name_ID,sex,row_number() over (partition by name order by name) as rowNum from testTableView")
//显示结果数据
resDF.show()
}
}
+-----+----------+------+------+
| name| name_ID| sex|rowNum|
+-----+----------+------+------+
| jack| jack_1006|Female| 1|
| john| john_1007|Female| 1|
|marry|marry_1008|Female| 1|
| lee| lee_1002| male| 1|
| lee| lee_1003| male| 2|
| lee| lee_1004|Female| 3|
| lee| lee_1005|Female| 4|
| lee| lee_1009| male| 5|
| xuan| xuan_1001| male| 1|
+-----+----------+------+------+