函数的作用:用来解析json字符串中的多个字段
hive (default)> create table rating_json(json string);
>load data local inpath '/home/hadoop/data/rating.json' overwrite into table rating_json; //导入数据
hive (default)> select * from rating_json limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.274 seconds, Fetched: 10 row(s)
hive (default)> select rating_json(json,"movie","rate","time","userid") from rating_json limit 10;
hive (default)> select json_tuple(json,"movie","rate","time","userid") as (moveid,rate,time,userid) from rating_json limit 10;//用json_tuple解析json字符串字段并重新给字段命名
OK
moveid rate time userid
1193 5 978300760 1
661 3 978302109 1
914 3 978301968 1
3408 4 978300275 1
2355 5 978824291 1
1197 3 978302268 1
1287 5 978302039 1
2804 5 978300719 1
594 4 978302268 1
919 4 978301368 1
Time taken: 0.046 seconds, Fetched: 10 row(s)
hive (default)>
然后再对其进行数据清洗:
raw(原始数据) ==> width 大宽表 :你后续需要的所有的字段我全部给你准备完毕
例如:userid,movie,rate,time,year,month,day,hour,minute,ts(yyyy-MM-dd HH:mm:ss)
前面四个字段有了,还需要加后面的所有字段
cast(time as bigint) //字符转换函数,这里是将string转换成bigint
unix_timestamp(“2019-07-21 12:21:21.645”) //将string字符串时间转换成int类型的时间戳
from_unixtime(1563682881) //将int类型的时间戳,装换成string :2019-07-21 12:21:21
hive (default)> select moveid,rate,time,userid,
> year(from_unixtime(cast(time as bigint))) as year,
> month(from_unixtime(cast(time as bigint))) as month,
> day(from_unixtime(cast(time as bigint))) as day,
> hour(from_unixtime(cast(time as bigint))) as hour,
> minute(from_unixtime(cast(time as bigint))) as minute,
> from_unixtime(cast(time as bigint)) as ts
> from
> (select json_tuple(json,"movie","rate","time","userid") as (moveid,rate,time,userid)
> from rating_json
> ) tmp
> limit 10;
OK
moveid rate time userid year month day hour minute ts
1193 5 978300760 1 2001 1 1 6 12 2001-01-01 06:12:40
661 3 978302109 1 2001 1 1 6 35 2001-01-01 06:35:09
914 3 978301968 1 2001 1 1 6 32 2001-01-01 06:32:48
3408 4 978300275 1 2001 1 1 6 4 2001-01-01 06:04:35
2355 5 978824291 1 2001 1 7 7 38 2001-01-07 07:38:11
1197 3 978302268 1 2001 1 1 6 37 2001-01-01 06:37:48
1287 5 978302039 1 2001 1 1 6 33 2001-01-01 06:33:59
2804 5 978300719 1 2001 1 1 6 11 2001-01-01 06:11:59
594 4 978302268 1 2001 1 1 6 37 2001-01-01 06:37:48
919 4 978301368 1 2001 1 1 6 22 2001-01-01 06:22:48
Time taken: 0.726 seconds, Fetched: 10 row(s)
hive (default)>
最后创建一张rating_width的大宽表,后续的统计分析都是基于这个rating_width表进行
hive (default)> create table rating_width as
> select moveid,rate,time,userid,
> year(from_unixtime(cast(time as bigint))) as year,
> month(from_unixtime(cast(time as bigint))) as month,
> day(from_unixtime(cast(time as bigint))) as day,
> hour(from_unixtime(cast(time as bigint))) as hour,
> minute(from_unixtime(cast(time as bigint))) as minute,
> from_unixtime(cast(time as bigint)) as ts
> from
> (select json_tuple(json,"movie","rate","time","userid") as (moveid,rate,time,userid)
> from rating_json
> ) tmp
> ;
Query ID = hadoop_20190721123434_a60f5f15-7479-444d-94a9-dac0a7f61a99
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1562553101223_0024, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0024/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job -kill job_1562553101223_0024
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-21 12:39:31,312 Stage-1 map = 0%, reduce = 0%
2019-07-21 12:39:46,743 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 15.98 sec
MapReduce Total cumulative CPU time: 15 seconds 980 msec
Ended Job = job_1562553101223_0024
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/.hive-staging_hive_2019-07-21_12-39-26_522_1513098922642793536-1/-ext-10001
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/rating_width
Table default.rating_width stats: [numFiles=1, numRows=1000209, totalSize=57005699, rawDataSize=56005490]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 15.98 sec HDFS Read: 63606728 HDFS Write: 57005786 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 980 msec
OK
moveid rate time userid year month day hour minute ts
Time taken: 21.583 seconds
hive (default)>
>
> select * from rating_width limit 5;
OK
rating_width.moveid rating_width.rate rating_width.time rating_width.userid rating_width.year rating_width.month rating_width.day rating_width.hour rating_width.minute rating_width.ts
1193 5 978300760 1 2001 1 1 6 12 2001-01-01 06:12:40
661 3 978302109 1 2001 1 1 6 35 2001-01-01 06:35:09
914 3 978301968 1 2001 1 1 6 32 2001-01-01 06:32:48
3408 4 978300275 1 2001 1 1 6 4 2001-01-01 06:04:35
2355 5 978824291 1 2001 1 7 7 38 2001-01-07 07:38:11
Time taken: 0.038 seconds, Fetched: 5 row(s)
hive (default)>
parse_url_tuple:hive内置函数,解析url字符串
用法:parse_url_tuple(url, partname1, partname2, ..., partnameN)
例如:http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d
可以解析成HOST、PATH、QUERY和COOKIEID字段。
hive (default)> select parse_url_tuple("http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d","HOST","PATH","QUERY","QUERY:cookieid") as (host,path,query ,cookieid) from dual;
OK
host path query cookieid
www.ruozedata.com /d7/xxx.html cookieid=1234567&a=b&c=d 1234567
Time taken: 0.04 seconds, Fetched: 1 row(s)
hive (default)>
如果最后不想取cookieid的值,想取a的值,也可以如下:
hive (default)> select parse_url_tuple("http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d","HOST","PATH","QUERY","QUERY:a") as (host,path,query,a) from dual;
OK
host path query a
www.ruozedata.com /d7/xxx.html cookieid=1234567&a=b&c=d b
Time taken: 0.046 seconds, Fetched: 1 row(s)
hive (default)>