相比于java编写udf,python编写udf就显得简单的多。站在数据处理的角度来说,python也更合适。那python如何编写udf函数呢?
(1)将编写的python代码上传到服务器
(2)添加python文件
(3)使用函数:TRANSFORM (data) USING “python udf_test.py” as (name,address)
对比java编写udf函数来说,少了打包、创建临时函数的过程
编写python代码:
#!/usr/bin/python
# coding=utf-8
import sys
for line in sys.stdin:
# 标准输入,剔除前后的空格,再根据,切割数据
name, addredd = line.strip().split(',')
# 输出,打印\t分割的数据
print('\t'.join([name, addredd]))
注:可以输入多个字段,输出多个字段,这里输出的是根据输入的字段切割成两个字段。
添加python文件:
spark-sql> add file /home/hadoop/python/udf_test.py;
Time taken: 0.395 seconds
引用python文件:
spark-sql> with test1 as
> (select 'xiaohong,beijing' as data
> union all
> select 'xiaolan,shanghai' as data)
> select TRANSFORM (data) USING "python udf_test.py" as (name,address) from test1;
xiaohong beijing
xiaolan shanghai
Time taken: 3.548 seconds, Fetched 2 row(s)
注:这里输出全是string类型,如果想转换类型可以用下面的方法
spark-sql> with test1 as
> (select 'xiaohong,10' as data
> union all
> select 'xiaolan,12' as data)
> select TRANSFORM (data) USING "python udf_test.py" as (name string,age int) from test1;
xiaohong 10
xiaolan 12
Time taken: 4.838 seconds, Fetched 2 row(s)
#!/usr/bin/python
# coding=utf-8
import sys
# 迭代器
def read_input(file, separator):
for line in file:
try:
yield line.strip().split(separator)
except:
print("error line")
pass
def main(separator='\t'):
dataline = read_input(sys.stdin, separator)
# dataline = read_input(open('D:\\data\\test5.txt').readlines(), separator)
for bonus1, bonus2, bonus3 in dataline:
trans_bonus1 = float(bonus1) * 0.6
trans_bonus2 = float(bonus2) * 0.8
trans_bonus3 = float(bonus3) * 0.9
print('%s\t%s\t%s' % (trans_bonus1, trans_bonus2, trans_bonus3))
if __name__ == '__main__':
main()
spark-sql> add file /home/hadoop/python/udf_test2.py;
Time taken: 0.367 seconds
spark-sql> with test1 as
> (select '10.22' as bonus1,'15.32' as bonus2,'18.36' as bonus3
> union all
> select '8.21' as bonus1,'9.36' as bonus2,'7.56' as bonus3)
> select TRANSFORM (bonus1,bonus2,bonus3) USING "python udf_test2.py" as (trans_bonus1,trans_bonus2,trans_bonus3) from test1;
6.132 12.256 16.524
4.926 7.488 6.804
Time taken: 6.259 seconds, Fetched 2 row(s)
注:输入的数据必须为string类型,否则会报错