python调用hive脚本_在Hive语句中使用脚本(如python和shell)进行map和reduce

在Hive语句中使用脚本(如python和shell)进行map和reduce,需要将利用命令transform(或者指定map和reduce),配合加入的脚本文件add

file

别名后面as省略也行,空格直接加,如: table app_stats t1, app_data t2;

先举一个小例子:

add file ${python_script_path}/lanch_interval_count.py;

drop table temp_lanch_interval2;

create table temp_lanch_interval2 as select reportdate, appid,channelname, app_version,

deviceid,ts,sameday

from (

from (

from (

select fl.reportdate, fl.appid, 1 as

app_version,fn.channelname,fl.deviceid,fl.linux_time

from (select reportdate, appid,

app_version,deviceid,linux_time from factloglanch

WHERE dt>= ? and

dt<= ? ) fl

left outer join factnewuser_nodimid fn on (fl.deviceid =

fn.deviceid and fl.appid = fn.appid)

) a

map reportdate, appid, channelname,app_version,

deviceid,linux_time using '/bin/cat'

as reportdate, appid, channelname,app_version,

deviceid,linux_time

cluster by appid, channelname,deviceid

) b

reduce reportdate, appid,

channelname,app_version, deviceid,linux_time using

'lanch_interval_count.py'

as reportdate, appid,app_version, channelname,deviceid,ts,sameday

) c

Hive中的TRANSFORM:使用脚本完成Map/Reduce

hive> select * from test;

OK

1 3

2 2

3 1

要输出每一列的md5值,hive中是没有这个udf,用Python的代码#!/home/tops/bin/python

#!/home/tops/bin/python

import sys

import hashlib

for line in sys.stdin:

line =

line.strip()

arr =

line.split()

md5_arr =

[]

for a in

arr:

md5_arr.append(hashlib.md5(a).hexdigest())

print

"\t".join(md5_arr)

在Hive中使用脚本(如,python和shell),首先要将他们加入:

add file /xxxx/test.py

然后,在程序中使用TRANSFORM语法调用:

SELECT TRANSFORM

(col1, col2) USING './test.py' AS (new1,

new2) FORM test;

其中,AS指定输出列,分别对应的列名。如果省略这句,Hive会将第1个tab前的结果作为key,后面其余作为value。

注意:TRANSFORM的分割符号,永远是\t。传入、传出脚本时都默认必须使用\t。没有其他分隔符

所以会出问题,在结合INSERT [OVERWRITE] table使用时,目标表的分隔符不是\t,是其他分隔符如';',

这样就会出错。

直接使用map 和reduce命令:

SELECT

MAP (…) USING ‘xx.py’是使用的语法,

MAP、REDUCE只不过是TRANSFORM的别名,Hive不保证一定会在map/reduce中调用脚本。看看官方文档是怎么说的:

Formally, MAP ... and REDUCE ... are syntactic transformations of

SELECT TRANSFORM ( ... ). In

other words, they serve as comments or notes to the reader of the

query.

BEWARE: Use of these keywords may be dangerous as (e.g.) typing

"REDUCE" does not force a reduce phase

to occur and typing "MAP" does not force a new map

phase!

所以,混用map reduce语法关键字可能会引起混淆,所以建议都用TRANSFORM。

如果不是脚本文件,而是awk、sed等系统内置命令,可以直接使用(不用add file),如:

map reportdate, appid, channelname,app_version,

deviceid,linux_time using

'/bin/cat'

as reportdate, appid, channelname,app_version,

deviceid,linux_time

cluster by appid, channelname,deviceid

如果,表中有MAP,ARRAY等复杂类型,

CREATE TABLE features

(

id

BIGINT,

norm_features MAP );

用TRANSFORM命令进行操作,就是将脚本文件的输出,设置为对应格式,Python里面就是print出对应的格式,而复杂类型就用其对应的分隔符

如,MAP类型的KV分割符。

SELECT TRANSFORM(stuff)

USING 'script'

AS (thing1 INT, thing2 MAP)

你可能感兴趣的:(python调用hive脚本)