Hive解析json格式数据

本文将介绍两个使用hive解析json的小demo

1.[hadoop@hadoop001 jsonData]$ more rating.json 
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
.......很多数据.....
hive (hwzhdb)> create table parsejson(
         > jsondata string
         > ); 
 OK
Time taken: 0.146 seconds
hive (hwzhdb)> load data local inpath '/home/hadoop/data/jsonData/rating.json' overwrite into table parsejson;
Loading data to table hwzhdb.parsejson
Table hwzhdb.parsejson stats: [numFiles=1, numRows=0, totalSize=63602280, rawDataSize=0]
OK
Time taken: 1.3 seconds
hive (hwzhdb)> select * from parsejson limit 10;
OK
parsejson.jsondata
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.075 seconds, Fetched: 10 row(s)
##使用json_tuple解析json数据,里面的参数填写json原数据的key,as别名是固定写法,此时取出来的都是实际的字段,并且都为string类型的
hive (hwzhdb)> select json_tuple(jsondata,'movie','rate','time','userid') as (movieid,rate,time,userid) 
             > from parsejson limit 10;
OK
movieid rate    time    userid
1193    5       978300760       1
661     3       978302109       1
914     3       978301968       1
3408    4       978300275       1
2355    5       978824291       1
1197    3       978302268       1
1287    5       978302039       1
2804    5       978300719       1
594     4       978302268       1
919     4       978301368       1

··

  2.parse_url_tuple函数,一个解析ip地址的小demo
    [hadoop@hadoop001 jsonData]$ cat ipdata.txt      --两条数据
    http://www.baidu.com/dir1/xxx.html?cookieid=9999&a=1&b=d
    http://www.google.com/dir2/xxx.html?cookieid=8888&a=1&b=m
    hive (hwzhdb)> create table parseurl(
                 > url string
                 > );
    OK
    Time taken: 0.129 seconds
    hive (hwzhdb)> load data local inpath '/home/hadoop/data/jsonData/ipdata.txt' into table parseurl;
    Loading data to table hwzhdb.parseurl
    Table hwzhdb.parseurl stats: [numFiles=1, totalSize=115]
    OK
    Time taken: 0.564 seconds
    hive (hwzhdb)> select * from parseurl;
    OK
    parseurl.url
    http://www.baidu.com/dir1/xxx.html?cookieid=9999&a=1&b=d
    http://www.google.com/dir2/xxx.html?cookieid=8888&a=1&b=m
    Time taken: 0.115 seconds, Fetched: 2 row(s)
    hive (hwzhdb)> select parse_url_tuple(url,'HOST','PATH','QUERY','QUERY:cookieid','QUERY:a','QUERY:b') from parseurl;
    OK
    c0      c1      c2      c3      c4      c5
    www.baidu.com   /dir1/xxx.html  cookieid=9999&a=1&b=d   9999    1       d
    www.google.com  /dir2/xxx.html  cookieid=8888&a=1&b=m   8888    1       m
    Time taken: 0.094 seconds, Fetched: 2 row(s)
    ##创建新表,然后将原表数据插入新表中
    hive (hwzhdb)> create table ddd(
                 > host string,
                 > path string,
                 > query string,
                 > cookieid string,
                 > a string,
                 > b string
                 > );
    OK
    Time taken: 0.1 seconds
    hive (hwzhdb)> insert into table ddd
                 > select parse_url_tuple(url,'HOST','PATH','QUERY','QUERY:cookieid','QUERY:a','QUERY:b') from parseurl;
                 hive (hwzhdb)> select * from ddd;
    OK
    ddd.host        ddd.path        ddd.query       ddd.cookieid    ddd.a   ddd.b
    www.baidu.com   /dir1/xxx.html  cookieid=9999&a=1&b=d   9999    1       d
    www.google.com  /dir2/xxx.html  cookieid=8888&a=1&b=m   8888    1       m
    Time taken: 0.098 seconds, Fetched: 2 row(s)

你可能感兴趣的:(Hive,hive,json)