hive自定义函数的python实现

案例1
文件1:test.py
# -*- coding: utf-8 -*-
import sys
for line in sys.stdin:
    print line.strip('\n')




文件2: input.log
hello, world!
python udf
这是一个测试文件
sys.stdin如何使用




执行结果:
[hotel@username udftest]$ vim input.log 
[hotel@username udftest]$ cat input.log | python test.py
hello, world! 
python udf
这是一个测试文件
sys.stdin如何使用




案例2
文件1:test.py
# -*- coding: utf-8 -*-


import sys


for line in sys.stdin:
    print line.strip()
    v1,v2,v3,v4=line.split(',')
    print '\t'.join([v1,v2,str(float(v3)+100),v4])


文件2: input.log
10,11,12,13
1,2,3,4
100,101,102,103
001,001,003,004


执行结果:
[hotel@username udftest]$ cat input.log | python test.py
10,11,12,13
10 11 112.0 13 


1,2,3,4
1 2 103.0 4


100,101,102,103
100 101 202.0 103


001,001,003,004
001 001 103.0 004




案例3:Hive自定义函数
part1: test1.py
# -*- coding: utf-8 -*-


import sys


for line in sys.stdin:
    line=line.strip()
    v1,v2,v3,v4=line.split('\t')
    print '\t'.join([v1,v2,str(float(v3)+100),v4])


part2:createtable.sql


建表语句:
-------- 测试表
USE databasename;
CREATE TABLE ym_test_table(
     hotelid  int comment '酒店ID',
     orderdate   string comment '订单日期',
     name  string comment '姓名',
     quantity int  comment '入住间夜',
     price double comment '房间价格'
     )
 COMMENT 'ym 测试功能函数1'
 STORED AS ORC;
 
select * from databasename.ym_test_table;                                           
hotelid orderdate name quantity price
1001 2016/1/1 张三 2 500
1002 2016/2/1 李四 5 300
1002 2016/1/10 王五 3 800
1003 2016/2/10 赵六 1 500


part3:结果输出


hive> add file test1.py;
Added resource: test1.py
hive> select transform(orderdate, name, quantity,price) using 'python test1.py' as (orderdate, name, quantity,price)
    > from databasename.ym_test_table; 
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475054520611_10447211, Tracking URL = http://SVR8498HW1288.hadoop.sh2.ctripcorp.com:8088/proxy/application_1475054520611_10447211/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1475054520611_10447211
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0
2016-11-21 16:24:48,050 Stage-1 map = 0%,  reduce = 0%
2016-11-21 16:24:57,480 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 2.27 sec
2016-11-21 16:24:58,531 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 4.89 sec
2016-11-21 16:25:02,705 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 12.67 sec
MapReduce Total cumulative CPU time: 12 seconds 670 msec
Ended Job = job_1475054520611_10447211
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 4   Cumulative CPU: 12.67 sec   HDFS Read: 3679 HDFS Write: 120 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 670 msec
OK
orderdate name quantity price
2016-02-10 赵六 101.0 500.0
2016-02-01 李四 105.0 300.0
2016-01-10 王五 103.0 800.0
2016-01-01 张三 102.0 500.0
Time taken: 34.082 seconds, Fetched: 4 row(s)






hive> select transform(orderdate, name, quantity,price) using 'python test1.py' as (orderdate, name, quantity_add,price)
    > from databasename.ym_test_table; 
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475054520611_10448031, Tracking URL = http://SVR8498HW1288.hadoop.sh2.ctripcorp.com:8088/proxy/application_1475054520611_10448031/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1475054520611_10448031
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0
2016-11-21 16:33:36,364 Stage-1 map = 0%,  reduce = 0%
2016-11-21 16:33:45,853 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 2.72 sec
2016-11-21 16:33:47,973 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 8.37 sec
2016-11-21 16:33:49,030 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 11.5 sec
MapReduce Total cumulative CPU time: 11 seconds 500 msec
Ended Job = job_1475054520611_10448031
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 4   Cumulative CPU: 11.5 sec   HDFS Read: 3679 HDFS Write: 120 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 500 msec
OK
orderdate name quantity_add price
2016-02-10 赵六 101.0 500.0
2016-02-01 李四 105.0 300.0
2016-01-10 王五 103.0 800.0
2016-01-01 张三 102.0 500.0
Time taken: 36.112 seconds, Fetched: 4 row(s)




补充:其它模板
1. 调用python、shell等语言
如下面这句sql就是借用了weekday_mapper.py对数据进行了处理
CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
 
add FILE weekday_mapper.py;
 
INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING 'python weekday_mapper.py'
  AS (userid, movieid, rating, weekday)
FROM u_data;
,其中weekday_mapper.py内容如下
import sys
import datetime
 
for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([userid, movieid, rating, str(weekday)])
 
如下面的例子则是使用了shell的cat命令来处理数据
FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';





















你可能感兴趣的:(python,Hive)