大部分SparkSql算子或者HiveSql能够解决大部分问题,但有的问题单纯的用现有的API很难实现,这个时候就得用到UDF函数了。
1,tom,23
2,jack,24
3,lily,18
4,lucy,19
5,rose,16
6,james,23
7,kobe,24
8,white,18
9,black,20
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import scala.util.Random
object Test {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder()
.appName(name = this.getClass.getSimpleName)
.master(master = "local[*]")
.getOrCreate()
import spark.sql
import spark.implicits._
val df_user = spark.read.textFile("./data/user")
.map(_.split(","))
.map(x => (x(0), x(1), x(2)))
.toDF("id", "name", "age")
.cache()
// spark.udf.register[String, String]("addPrefix", field => randomPrefixUDF(field))
// spark.udf.register[String, String]("rmPrefix", field => removePrefixUDF(field))
spark.udf.register("addPrefix", (field: String) => randomPrefixUDF(field))
spark.udf.register("rmPrefix", (field: String) => removePrefixUDF(field))
df_user.createTempView(viewName = "view")
val df_prefix = sql(sqlText = "select addPrefix(name) as pre_name from view")
df_prefix.show()
df_prefix.createTempView(viewName = "pre_view")
sql(sqlText = "select rmPrefix(pre_name) as name from pre_view").show()
spark.stop()
}
/**
* 给DataFrame指定字段随机加"_"前缀
*
* @param field 字段名称
* @return 加完前缀后的值
*/
def randomPrefixUDF(field: String): String = {
val random = new Random()
val prefix = random.nextInt(10)
prefix + "_" + field
}
/**
* 去除DataFrame指定字段随机加的"_"前缀
*
* @param field 字段名称
* @return 去除前缀后的值
*/
def removePrefixUDF(field: String): String = {
field.split("_")(1)
}
}
+--------+
|pre_name|
+--------+
| 5_tom|
| 3_jack|
| 3_lily|
| 0_lucy|
| 4_rose|
| 8_james|
| 9_kobe|
| 4_white|
| 8_black|
+--------+
+-----+
| name|
+-----+
| tom|
| jack|
| lily|
| lucy|
| rose|
|james|
| kobe|
|white|
|black|
+-----+
如果调用自定义函数的时候写成如下:
spark.udf.register("addPrefix", field => randomPrefixUDF(field))
spark.udf.register("rmPrefix", field => removePrefixUDF(field))
编译时就会直接报错:
Error:(29, 37) missing parameter type
spark.udf.register("addPrefix", field => randomPrefixUDF(field))
Error:(30, 36) missing parameter type
spark.udf.register("rmPrefix", field => removePrefixUDF(field))
因此必须写成如下:
spark.udf.register("addPrefix", (field: String) => randomPrefixUDF(field))
spark.udf.register("rmPrefix", (field: String) => removePrefixUDF(field))
或者:
spark.udf.register[String, String]("addPrefix", field => randomPrefixUDF(field))
spark.udf.register[String, String]("rmPrefix", field => removePrefixUDF(field))
这里写的这个udf函数,经常会在Spark代码调优的时候使用