案例1
文件1:test.py
# -*- coding: utf-8 -*-
import sys
for line in sys.stdin:
print line.strip('\n')
文件2: input.log
hello, world!
python udf
这是一个测试文件
sys.stdin如何使用
执行结果:
[hotel@username udftest]$ vim input.log
[hotel@username udftest]$ cat input.log | python test.py
hello, world!
python udf
这是一个测试文件
sys.stdin如何使用
案例2
文件1:test.py
# -*- coding: utf-8 -*-
import sys
for line in sys.stdin:
print line.strip()
v1,v2,v3,v4=line.split(',')
print '\t'.join([v1,v2,str(float(v3)+100),v4])
文件2: input.log
10,11,12,13
1,2,3,4
100,101,102,103
001,001,003,004
执行结果:
[hotel@username udftest]$ cat input.log | python test.py
10,11,12,13
10
11
112.0
13
1,2,3,4
1
2
103.0
4
100,101,102,103
100
101
202.0
103
001,001,003,004
001
001
103.0
004
案例3:Hive自定义函数
part1: test1.py
# -*- coding: utf-8 -*-
import sys
for line in sys.stdin:
line=line.strip()
v1,v2,v3,v4=line.split('\t')
print '\t'.join([v1,v2,str(float(v3)+100),v4])
part2:createtable.sql
建表语句:
-------- 测试表
USE databasename;
CREATE TABLE ym_test_table(
hotelid int comment '酒店ID',
orderdate string comment '订单日期',
name string comment '姓名',
quantity int comment '入住间夜',
price double comment '房间价格'
)
COMMENT 'ym 测试功能函数1'
STORED AS ORC;
select * from databasename.ym_test_table;
hotelid
orderdate
name
quantity
price
1001
2016/1/1
张三
2
500
1002
2016/2/1
李四
5
300
1002
2016/1/10
王五
3
800
1003
2016/2/10
赵六
1
500
part3:结果输出
hive> add file test1.py;
Added resource: test1.py
hive> select transform(orderdate, name, quantity,price) using 'python test1.py' as (orderdate, name, quantity,price)
> from databasename.ym_test_table;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475054520611_10447211, Tracking URL = http://SVR8498HW1288.hadoop.sh2.ctripcorp.com:8088/proxy/application_1475054520611_10447211/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1475054520611_10447211
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0
2016-11-21 16:24:48,050 Stage-1 map = 0%, reduce = 0%
2016-11-21 16:24:57,480 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 2.27 sec
2016-11-21 16:24:58,531 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 4.89 sec
2016-11-21 16:25:02,705 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.67 sec
MapReduce Total cumulative CPU time: 12 seconds 670 msec
Ended Job = job_1475054520611_10447211
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Cumulative CPU: 12.67 sec HDFS Read: 3679 HDFS Write: 120 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 670 msec
OK
orderdate
name
quantity
price
2016-02-10
赵六
101.0
500.0
2016-02-01
李四
105.0
300.0
2016-01-10
王五
103.0
800.0
2016-01-01
张三
102.0
500.0
Time taken: 34.082 seconds, Fetched: 4 row(s)
hive> select transform(orderdate, name, quantity,price) using 'python test1.py' as (orderdate, name, quantity_add,price)
> from databasename.ym_test_table;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1475054520611_10448031, Tracking URL = http://SVR8498HW1288.hadoop.sh2.ctripcorp.com:8088/proxy/application_1475054520611_10448031/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1475054520611_10448031
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 0
2016-11-21 16:33:36,364 Stage-1 map = 0%, reduce = 0%
2016-11-21 16:33:45,853 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 2.72 sec
2016-11-21 16:33:47,973 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 8.37 sec
2016-11-21 16:33:49,030 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 11.5 sec
MapReduce Total cumulative CPU time: 11 seconds 500 msec
Ended Job = job_1475054520611_10448031
MapReduce Jobs Launched:
Stage-Stage-1: Map: 4 Cumulative CPU: 11.5 sec HDFS Read: 3679 HDFS Write: 120 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 500 msec
OK
orderdate
name
quantity_add
price
2016-02-10
赵六
101.0
500.0
2016-02-01
李四
105.0
300.0
2016-01-10
王五
103.0
800.0
2016-01-01
张三
102.0
500.0
Time taken: 36.112 seconds, Fetched: 4 row(s)
补充:其它模板
1. 调用python、shell等语言
如下面这句sql就是借用了weekday_mapper.py对数据进行了处理
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
add FILE weekday_mapper.py;
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
,其中weekday_mapper.py内容如下
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])
如下面的例子则是使用了shell的cat命令来处理数据
FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';