Spark UDF处理Array<struct>结构

UDF处理Array结构

不多说,直接上代码吧

数据结构:

root

 |-- id: string (nullable = true)

 |-- offset: long (nullable = true)

 |-- tags: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- tagId: integer (nullable = true)

 |    |    |-- weight: double (nullable = true)

 

//定义udf函数,新增一列

val computeScoreFun:mutable.WrappedArray[Row]=>Double ={tags=>
  val scoreMap = tags.map{row =>
    val tagId = row.getInt(row.fieldIndex("tagId"))
    val weight = row.getDouble(row.fieldIndex("weight"))
    tagId -> weight
  }.toMap
  val addScore =newAdds.map(id =>scoreMap.getOrElse(id,0.0)).reduce(_ + _)
  val minusScore = newMinus.map(id =>scoreMap.getOrElse(id,0.0)).reduce(_ + _)
  addScore - minusScore
}

val scoreUDF =udf(computeScoreFun)
//新增一列score
spark.read.parquet(input).withColumn("score",scoreUDF($"tags"))

你可能感兴趣的:(spark,spark)