本文将介绍两个使用hive解析json的小demo
1.[hadoop@hadoop001 jsonData]$ more rating.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
.......很多数据.....
hive (hwzhdb)> create table parsejson(
> jsondata string
> );
OK
Time taken: 0.146 seconds
hive (hwzhdb)> load data local inpath '/home/hadoop/data/jsonData/rating.json' overwrite into table parsejson;
Loading data to table hwzhdb.parsejson
Table hwzhdb.parsejson stats: [numFiles=1, numRows=0, totalSize=63602280, rawDataSize=0]
OK
Time taken: 1.3 seconds
hive (hwzhdb)> select * from parsejson limit 10;
OK
parsejson.jsondata
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.075 seconds, Fetched: 10 row(s)
##使用json_tuple解析json数据,里面的参数填写json原数据的key,as别名是固定写法,此时取出来的都是实际的字段,并且都为string类型的
hive (hwzhdb)> select json_tuple(jsondata,'movie','rate','time','userid') as (movieid,rate,time,userid)
> from parsejson limit 10;
OK
movieid rate time userid
1193 5 978300760 1
661 3 978302109 1
914 3 978301968 1
3408 4 978300275 1
2355 5 978824291 1
1197 3 978302268 1
1287 5 978302039 1
2804 5 978300719 1
594 4 978302268 1
919 4 978301368 1
··
2.parse_url_tuple函数,一个解析ip地址的小demo
[hadoop@hadoop001 jsonData]$ cat ipdata.txt --两条数据
http://www.baidu.com/dir1/xxx.html?cookieid=9999&a=1&b=d
http://www.google.com/dir2/xxx.html?cookieid=8888&a=1&b=m
hive (hwzhdb)> create table parseurl(
> url string
> );
OK
Time taken: 0.129 seconds
hive (hwzhdb)> load data local inpath '/home/hadoop/data/jsonData/ipdata.txt' into table parseurl;
Loading data to table hwzhdb.parseurl
Table hwzhdb.parseurl stats: [numFiles=1, totalSize=115]
OK
Time taken: 0.564 seconds
hive (hwzhdb)> select * from parseurl;
OK
parseurl.url
http://www.baidu.com/dir1/xxx.html?cookieid=9999&a=1&b=d
http://www.google.com/dir2/xxx.html?cookieid=8888&a=1&b=m
Time taken: 0.115 seconds, Fetched: 2 row(s)
hive (hwzhdb)> select parse_url_tuple(url,'HOST','PATH','QUERY','QUERY:cookieid','QUERY:a','QUERY:b') from parseurl;
OK
c0 c1 c2 c3 c4 c5
www.baidu.com /dir1/xxx.html cookieid=9999&a=1&b=d 9999 1 d
www.google.com /dir2/xxx.html cookieid=8888&a=1&b=m 8888 1 m
Time taken: 0.094 seconds, Fetched: 2 row(s)
##创建新表,然后将原表数据插入新表中
hive (hwzhdb)> create table ddd(
> host string,
> path string,
> query string,
> cookieid string,
> a string,
> b string
> );
OK
Time taken: 0.1 seconds
hive (hwzhdb)> insert into table ddd
> select parse_url_tuple(url,'HOST','PATH','QUERY','QUERY:cookieid','QUERY:a','QUERY:b') from parseurl;
hive (hwzhdb)> select * from ddd;
OK
ddd.host ddd.path ddd.query ddd.cookieid ddd.a ddd.b
www.baidu.com /dir1/xxx.html cookieid=9999&a=1&b=d 9999 1 d
www.google.com /dir2/xxx.html cookieid=8888&a=1&b=m 8888 1 m
Time taken: 0.098 seconds, Fetched: 2 row(s)