SparkSql用户自定义函数(UDF函数)

前言

大部分SparkSql算子或者HiveSql能够解决大部分问题,但有的问题单纯的用现有的API很难实现,这个时候就得用到UDF函数了。

数据集准备

1,tom,23
2,jack,24
3,lily,18
4,lucy,19
5,rose,16
6,james,23
7,kobe,24
8,white,18
9,black,20

代码

import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession

import scala.util.Random

object Test {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR)
    val spark = SparkSession.builder()
      .appName(name = this.getClass.getSimpleName)
      .master(master = "local[*]")
      .getOrCreate()

    import spark.sql
    import spark.implicits._
    val df_user = spark.read.textFile("./data/user")
      .map(_.split(","))
      .map(x => (x(0), x(1), x(2)))
      .toDF("id", "name", "age")
      .cache()

    //    spark.udf.register[String, String]("addPrefix", field => randomPrefixUDF(field))
    //    spark.udf.register[String, String]("rmPrefix", field => removePrefixUDF(field))

    spark.udf.register("addPrefix", (field: String) => randomPrefixUDF(field))
    spark.udf.register("rmPrefix", (field: String) => removePrefixUDF(field))

    df_user.createTempView(viewName = "view")
    val df_prefix = sql(sqlText = "select addPrefix(name) as pre_name  from view")
    df_prefix.show()
    df_prefix.createTempView(viewName = "pre_view")
    sql(sqlText = "select rmPrefix(pre_name) as name from pre_view").show()
    spark.stop()
  }

  /**
    * 给DataFrame指定字段随机加"_"前缀
    *
    * @param field 字段名称
    * @return 加完前缀后的值
    */
  def randomPrefixUDF(field: String): String = {
    val random = new Random()
    val prefix = random.nextInt(10)
    prefix + "_" + field
  }

  /**
    * 去除DataFrame指定字段随机加的"_"前缀
    *
    * @param field 字段名称
    * @return 去除前缀后的值
    */
  def removePrefixUDF(field: String): String = {
    field.split("_")(1)
  }
}

结果

+--------+
|pre_name|
+--------+
|   5_tom|
|  3_jack|
|  3_lily|
|  0_lucy|
|  4_rose|
| 8_james|
|  9_kobe|
| 4_white|
| 8_black|
+--------+

+-----+
| name|
+-----+
|  tom|
| jack|
| lily|
| lucy|
| rose|
|james|
| kobe|
|white|
|black|
+-----+

注意的问题

如果调用自定义函数的时候写成如下:

spark.udf.register("addPrefix", field => randomPrefixUDF(field))
spark.udf.register("rmPrefix", field => removePrefixUDF(field))

编译时就会直接报错:

Error:(29, 37) missing parameter type
    spark.udf.register("addPrefix", field => randomPrefixUDF(field))
    
Error:(30, 36) missing parameter type
    spark.udf.register("rmPrefix", field => removePrefixUDF(field))

因此必须写成如下:

spark.udf.register("addPrefix", (field: String) => randomPrefixUDF(field))
spark.udf.register("rmPrefix", (field: String) => removePrefixUDF(field))

或者:

spark.udf.register[String, String]("addPrefix", field => randomPrefixUDF(field))
spark.udf.register[String, String]("rmPrefix", field => removePrefixUDF(field))

后记

这里写的这个udf函数,经常会在Spark代码调优的时候使用

你可能感兴趣的:(Spark)