HIVE函数(json_tuple)与(parse_url_tuple)

1.json_tuple函数

json_tnple函数用来处理json数据
使用方法:
json_tuple(jsonStr, p1, p2, …, pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.

创建一张表用来存储json数据
hive (ruozedata_d7)> create table  IF NOT EXISTS rating_json(json string);
OK
Time taken: 0.023 seconds
hive (ruozedata_d7)> load data local inpath '/home/hadoop/rating.json' overwrite into table rating_json;
hive (ruozedata_d7)> select * from rating_json limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.056 seconds, Fetched: 10 row(s)
可以看到通过json_tuple函数,将json分割成了一个个字段
hive (ruozedata_d7)> select 
                   > json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) 
                   > from rating_json limit 10;
OK
movie_id	rate	time	user_id
1193	5	978300760	1
661	3	978302109	1
914	3	978301968	1
3408	4	978300275	1
2355	5	978824291	1
1197	3	978302268	1
1287	5	978302039	1
2804	5	978300719	1
594	4	978302268	1
919	4	978301368	1
Time taken: 0.676 seconds, Fetched: 10 row(s)
然后我们通过json_tnple创建一张大宽表,你后续需要的所有的字段我全部给你准备完毕
hive (ruozedata_d7)> create table rating_width as
                   > select
                   > movie_id,rate,time,user_id,
                   > year(from_unixtime(cast(time as bigint))) as year,
                   > month(from_unixtime(cast(time as bigint))) as month,
                   > day(from_unixtime(cast(time as bigint))) as day,
                   > hour(from_unixtime(cast(time as bigint))) as hour,
                   > minute(from_unixtime(cast(time as bigint))) as minute,
                   > from_unixtime(cast(time as bigint)) as ts
                   > from
                   > (
                   > select 
                   > json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) 
                   > from rating_json 
                   > ) tmp
                   > ;
可以看到我们创建好之后的字段
hive (ruozedata_d7)> select * from rating_width limit 10;
OK
rating_width.movie_id	rating_width.rate	rating_width.time	rating_width.user_id	rating_width.year	rating_width.month	rating_width.day	rating_width.hourrating_width.minute	rating_width.ts
1193	5	978300760	1	2001	1	1	6	12	2001-01-01 06:12:40
661	3	978302109	1	2001	1	1	6	35	2001-01-01 06:35:09
914	3	978301968	1	2001	1	1	6	32	2001-01-01 06:32:48
3408	4	978300275	1	2001	1	1	6	4	2001-01-01 06:04:35
2355	5	978824291	1	2001	1	7	7	38	2001-01-07 07:38:11
1197	3	978302268	1	2001	1	1	6	37	2001-01-01 06:37:48
1287	5	978302039	1	2001	1	1	6	33	2001-01-01 06:33:59
2804	5	978300719	1	2001	1	1	6	11	2001-01-01 06:11:59
594	4	978302268	1	2001	1	1	6	37	2001-01-01 06:37:48
919	4	978301368	1	2001	1	1	6	22	2001-01-01 06:22:48
Time taken: 0.068 seconds, Fetched: 10 row(s)

2.parse_url_tuple函数

使用方法:
hive (ruozedata_d7)> desc function extended parse_url_tuple;
OK
tab_name
parse_url_tuple(url, partname1, partname2, …, partnameN) - extracts N (N>=1) parts from a URL.
It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are string.
Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:
Note: Partnames are case-sensitive, and should not contain unnecessary white spaces.
Example:

SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, ‘HOST’, ‘PATH’, ‘QUERY’, ‘QUERY:id’) b as host, path, query, query_id LIMIT 1;
SELECT parse_url_tuple(a.fullurl, ‘HOST’, ‘PATH’, ‘QUERY’, ‘REF’, ‘PROTOCOL’, ‘FILE’, ‘AUTHORITY’, ‘USERINFO’, ‘QUERY:k1’) as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a;
Time taken: 0.006 seconds, Fetched: 7 row(s)

hive (default)> select parse_url_tuple("http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d","HOST","PATH","QUERY","QUERY:cookieid") as (host,path,query ,cookieid) from dual;
OK
host	path	query	cookieid
www.ruozedata.com	/d7/xxx.html	cookieid=1234567&a=b&c=d	1234567

可以看出,将对应的URL解析成我们想要的字段

你可能感兴趣的:(hive)