json_tnple函数用来处理json数据
使用方法:
json_tuple(jsonStr, p1, p2, …, pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.
创建一张表用来存储json数据
hive (ruozedata_d7)> create table IF NOT EXISTS rating_json(json string);
OK
Time taken: 0.023 seconds
hive (ruozedata_d7)> load data local inpath '/home/hadoop/rating.json' overwrite into table rating_json;
hive (ruozedata_d7)> select * from rating_json limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.056 seconds, Fetched: 10 row(s)
可以看到通过json_tuple函数,将json分割成了一个个字段
hive (ruozedata_d7)> select
> json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id)
> from rating_json limit 10;
OK
movie_id rate time user_id
1193 5 978300760 1
661 3 978302109 1
914 3 978301968 1
3408 4 978300275 1
2355 5 978824291 1
1197 3 978302268 1
1287 5 978302039 1
2804 5 978300719 1
594 4 978302268 1
919 4 978301368 1
Time taken: 0.676 seconds, Fetched: 10 row(s)
然后我们通过json_tnple创建一张大宽表,你后续需要的所有的字段我全部给你准备完毕
hive (ruozedata_d7)> create table rating_width as
> select
> movie_id,rate,time,user_id,
> year(from_unixtime(cast(time as bigint))) as year,
> month(from_unixtime(cast(time as bigint))) as month,
> day(from_unixtime(cast(time as bigint))) as day,
> hour(from_unixtime(cast(time as bigint))) as hour,
> minute(from_unixtime(cast(time as bigint))) as minute,
> from_unixtime(cast(time as bigint)) as ts
> from
> (
> select
> json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id)
> from rating_json
> ) tmp
> ;
可以看到我们创建好之后的字段
hive (ruozedata_d7)> select * from rating_width limit 10;
OK
rating_width.movie_id rating_width.rate rating_width.time rating_width.user_id rating_width.year rating_width.month rating_width.day rating_width.hourrating_width.minute rating_width.ts
1193 5 978300760 1 2001 1 1 6 12 2001-01-01 06:12:40
661 3 978302109 1 2001 1 1 6 35 2001-01-01 06:35:09
914 3 978301968 1 2001 1 1 6 32 2001-01-01 06:32:48
3408 4 978300275 1 2001 1 1 6 4 2001-01-01 06:04:35
2355 5 978824291 1 2001 1 7 7 38 2001-01-07 07:38:11
1197 3 978302268 1 2001 1 1 6 37 2001-01-01 06:37:48
1287 5 978302039 1 2001 1 1 6 33 2001-01-01 06:33:59
2804 5 978300719 1 2001 1 1 6 11 2001-01-01 06:11:59
594 4 978302268 1 2001 1 1 6 37 2001-01-01 06:37:48
919 4 978301368 1 2001 1 1 6 22 2001-01-01 06:22:48
Time taken: 0.068 seconds, Fetched: 10 row(s)
使用方法:
hive (ruozedata_d7)> desc function extended parse_url_tuple;
OK
tab_name
parse_url_tuple(url, partname1, partname2, …, partnameN) - extracts N (N>=1) parts from a URL.
It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are string.
Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:
Note: Partnames are case-sensitive, and should not contain unnecessary white spaces.
Example:
SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, ‘HOST’, ‘PATH’, ‘QUERY’, ‘QUERY:id’) b as host, path, query, query_id LIMIT 1;
SELECT parse_url_tuple(a.fullurl, ‘HOST’, ‘PATH’, ‘QUERY’, ‘REF’, ‘PROTOCOL’, ‘FILE’, ‘AUTHORITY’, ‘USERINFO’, ‘QUERY:k1’) as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a;
Time taken: 0.006 seconds, Fetched: 7 row(s)
hive (default)> select parse_url_tuple("http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d","HOST","PATH","QUERY","QUERY:cookieid") as (host,path,query ,cookieid) from dual;
OK
host path query cookieid
www.ruozedata.com /d7/xxx.html cookieid=1234567&a=b&c=d 1234567
可以看出,将对应的URL解析成我们想要的字段