如何用python编写hive udf函数

相比于java编写udf,python编写udf就显得简单的多。站在数据处理的角度来说,python也更合适。那python如何编写udf函数呢?

使用方法:

(1)将编写的python代码上传到服务器
(2)添加python文件
(3)使用函数:TRANSFORM (data) USING “python udf_test.py” as (name,address)
对比java编写udf函数来说,少了打包、创建临时函数的过程

实例1:

编写python代码:

#!/usr/bin/python
# coding=utf-8

import sys


for line in sys.stdin:
    # 标准输入,剔除前后的空格,再根据,切割数据
    name, addredd = line.strip().split(',')
    # 输出,打印\t分割的数据
    print('\t'.join([name, addredd]))

注:可以输入多个字段,输出多个字段,这里输出的是根据输入的字段切割成两个字段。

添加python文件:

spark-sql> add file /home/hadoop/python/udf_test.py;
Time taken: 0.395 seconds

引用python文件:

spark-sql> with test1 as
         > (select 'xiaohong,beijing' as data
         > union all
         > select 'xiaolan,shanghai' as data)
         > select TRANSFORM (data) USING "python udf_test.py" as (name,address) from test1;
xiaohong        beijing                                                         
xiaolan			shanghai
Time taken: 3.548 seconds, Fetched 2 row(s)

注:这里输出全是string类型,如果想转换类型可以用下面的方法

spark-sql> with test1 as
         > (select 'xiaohong,10' as data
         > union all
         > select 'xiaolan,12' as data)
         > select TRANSFORM (data) USING "python udf_test.py" as (name string,age int) from test1;
xiaohong        10                                                              
xiaolan			12
Time taken: 4.838 seconds, Fetched 2 row(s)
实例2:
#!/usr/bin/python
# coding=utf-8
import sys

# 迭代器
def read_input(file, separator):
    for line in file:
        try:
            yield line.strip().split(separator)
        except:
            print("error line")
            pass


def main(separator='\t'):
    dataline = read_input(sys.stdin, separator)
    # dataline = read_input(open('D:\\data\\test5.txt').readlines(), separator)

    for bonus1, bonus2, bonus3 in dataline:
        trans_bonus1 = float(bonus1) * 0.6
        trans_bonus2 = float(bonus2) * 0.8
        trans_bonus3 = float(bonus3) * 0.9
        print('%s\t%s\t%s' % (trans_bonus1, trans_bonus2, trans_bonus3))


if __name__ == '__main__':
    main()
spark-sql>  add file /home/hadoop/python/udf_test2.py;
Time taken: 0.367 seconds
spark-sql> with test1 as
         > (select '10.22' as bonus1,'15.32' as bonus2,'18.36' as bonus3
         > union all
         > select '8.21' as bonus1,'9.36' as bonus2,'7.56' as bonus3)
         > select TRANSFORM (bonus1,bonus2,bonus3) USING "python udf_test2.py" as (trans_bonus1,trans_bonus2,trans_bonus3) from test1;
6.132   	12.256  	16.524                                                          
4.926		7.488		6.804
Time taken: 6.259 seconds, Fetched 2 row(s)

注:输入的数据必须为string类型,否则会报错

你可能感兴趣的:(hive,大数据,数据仓库,hive,sql,python)