创建一个只有一个string类型的字段来存放json数据的表,将下列类型数据load进表中:
hive (d1_hive)> select * from rating_json limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.311 seconds, Fetched: 10 row(s)
hive (d1_hive)>
如何将json数据拆分出来呢?使用json_tuple函数能够简单处理这种事。
hive (d1_hive)> select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json limit 10;
OK
movie_id rate time user_id
1193 5 978300760 1
661 3 978302109 1
914 3 978301968 1
3408 4 978300275 1
2355 5 978824291 1
1197 3 978302268 1
1287 5 978302039 1
2804 5 978300719 1
594 4 978302268 1
919 4 978301368 1
Time taken: 0.121 seconds, Fetched: 10 row(s)
hive (d1_hive)>
针对上述例子,在生产上一般来说是要将time再次处理成时间戳/年/月/日等,新成一张大宽表,以便后续会用到。
hive (d1_hive)> select movie_id,rate,user_id,
> from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss') as time,
> year(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as year,
> month(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as month,
> day(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as day
> from (
> select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json
> ) t limit 10;
OK
movie_id rate user_id time year month day
1193 5 1 2001-01-01 06:12:40 2001 1 1
661 3 1 2001-01-01 06:35:09 2001 1 1
914 3 1 2001-01-01 06:32:48 2001 1 1
3408 4 1 2001-01-01 06:04:35 2001 1 1
2355 5 1 2001-01-07 07:38:11 2001 1 7
1197 3 1 2001-01-01 06:37:48 2001 1 1
1287 5 1 2001-01-01 06:33:59 2001 1 1
2804 5 1 2001-01-01 06:11:59 2001 1 1
594 4 1 2001-01-01 06:37:48 2001 1 1
919 4 1 2001-01-01 06:22:48 2001 1 1
Time taken: 0.176 seconds, Fetched: 10 row(s)
hive (d1_hive)>
# 大宽表,就是用来存储拆分结果的表,后续的统计分析都是基于这个大宽表进行
hive (d1_hive)> create table rate_movie as
> select movie_id,rate,user_id,
> from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss') as time,
> year(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as year,
> month(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as month,
> day(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as day
> from (
> select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id) from rating_json
> ) t;
Query ID = hadoop_20190728233636_919dc0ee-61ea-4bdf-a725-2c87f43def12
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1564328140990_0002, Tracking URL = http://localhost:4044/proxy/application_1564328140990_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1564328140990_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-28 23:41:05,630 Stage-1 map = 0%, reduce = 0%
2019-07-28 23:41:25,706 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 16.24 sec
MapReduce Total cumulative CPU time: 16 seconds 240 msec
Ended Job = job_1564328140990_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/d1_hive.db/.hive-staging_hive_2019-07-28_23-40-57_676_3047408219089077111-1/-ext-10001
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/d1_hive.db/rate_movie
Table d1_hive.rate_movie stats: [numFiles=1, numRows=1000209, totalSize=41695855, rawDataSize=40695646]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 16.24 sec HDFS Read: 63606529 HDFS Write: 41695940 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 240 msec
OK
movie_id rate user_id time year month day
Time taken: 30.415 seconds
hive (d1_hive)> select * from rate_movie limit 10;
OK
rate_movie.movie_id rate_movie.rate rate_movie.user_id rate_movie.time rate_movie.year rate_movie.month rate_movie.day
1193 5 1 2001-01-01 06:12:40 2001 1 1
661 3 1 2001-01-01 06:35:09 2001 1 1
914 3 1 2001-01-01 06:32:48 2001 1 1
3408 4 1 2001-01-01 06:04:35 2001 1 1
2355 5 1 2001-01-07 07:38:11 2001 1 7
1197 3 1 2001-01-01 06:37:48 2001 1 1
1287 5 1 2001-01-01 06:33:59 2001 1 1
2804 5 1 2001-01-01 06:11:59 2001 1 1
594 4 1 2001-01-01 06:37:48 2001 1 1
919 4 1 2001-01-01 06:22:48 2001 1 1
Time taken: 0.089 seconds, Fetched: 10 row(s)
hive (d1_hive)>
创建一个只有一个string类型的字段来存放url数据的表,将下列类型数据load进表中:
hive (d1_hive)> create table url(url string);
OK
Time taken: 0.082 seconds
hive (d1_hive)> load data local inpath '/home/hadoop/data/url.txt' overwrite into table url;
Loading data to table d1_hive.url
Table d1_hive.url stats: [numFiles=1, numRows=0, totalSize=61, rawDataSize=0]
OK
Time taken: 0.333 seconds
hive (d1_hive)> select * from url;
OK
url.url
http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d
Time taken: 0.084 seconds, Fetched: 1 row(s)
hive (d1_hive)>
如何将url数据拆分出来呢?使用parse_url_tuple函数能够简单处理这种事。
hive (d1_hive)> select parse_url_tuple(url, 'HOST', 'PATH', 'QUERY', 'QUERY:cookieid','QUERY:a') as (host,path,query,cookie_id,a) from url;
OK
host path query cookie_id a
www.ruozedata.com /d7/xxx.html cookieid=1234567&a=b&c=d 1234567 b
Time taken: 0.089 seconds, Fetched: 1 row(s)
hive (d1_hive)>
以上面解析json出来的大宽表为例,求每个用户评分最高的三部电影。主要使用到Analytics函数ROW_NUMBER,它让同一分区的进行排名。
hive (d1_hive)> select user_id,movie_id,rate,time from
> (
> select user_id,movie_id,rate,time,row_number() over(partition by user_id order by rate desc) as r from rate_movie
> ) t where t.r<=3 limit 10;
Query ID = hadoop_20190728233636_919dc0ee-61ea-4bdf-a725-2c87f43def12
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1564328140990_0008, Tracking URL = http://localhost:4044/proxy/application_1564328140990_0008/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1564328140990_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-07-29 00:13:11,504 Stage-1 map = 0%, reduce = 0%
2019-07-29 00:13:23,046 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 7.28 sec
2019-07-29 00:13:33,574 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 12.01 sec
MapReduce Total cumulative CPU time: 12 seconds 10 msec
Ended Job = job_1564328140990_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 12.01 sec HDFS Read: 41704880 HDFS Write: 295 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 10 msec
OK
user_id movie_id rate time
1 1035 5 2001-01-01 06:29:13
1 1 5 2001-01-07 07:37:48
1 2028 5 2001-01-01 06:26:59
10 3591 5 2000-12-31 10:08:45
10 3809 5 2000-12-31 10:07:31
10 954 5 2000-12-31 09:25:22
100 800 5 2000-12-24 01:51:55
100 527 5 2000-12-24 02:07:19
100 919 5 2000-12-24 02:09:07
1000 2571 5 2000-11-24 12:46:50
Time taken: 30.207 seconds, Fetched: 10 row(s)
hive (d1_hive)>
这里在beeline连接之前,必须先将hiveserver2启动。hiveserver2(HS2)是一种允许客户端对Hive执行查询的服务,只有先将这个服务启动,JDBC/beeline一类客户端才能访问。
# 启动hs2
[hadoop@10-9-15-140 bin]$ nohup ./hiveserver2 >> /tmp/hs2.log 2>&1 &
[1] 10556
[hadoop@10-9-15-140 bin]$ tail -F /tmp/hs2.log
# beeline连接,默认端口为10000
[hadoop@10-9-15-140 bin]$ ./beeline -u jdbc:hive2://hadoop001:10000/d1_hive -n hadoop
which: no hbase in (/home/hadoop/app/hive-1.1.0-cdh5.7.0/bin:/home/hadoop/app/protobuf-2.5.0/bin:/home/hadoop/app/apache-maven-3.3.9/bin:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin:/usr/java/jdk1.7.0_80/bin:/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
scan complete in 4ms
Connecting to jdbc:hive2://hadoop001:10000/d1_hive
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.7.0 by Apache Hive
0: jdbc:hive2://hadoop001:10000/d1_hive> show databases;
+----------------+--+
| database_name |
+----------------+--+
| d1_hive |
| default |
+----------------+--+
2 rows selected (1.241 seconds)
INFO : Compiling command(queryId=hadoop_20190729002121_95a9d614-d5df-49f1-85ee-ea0ec2b66ba4): show databases
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hadoop_20190729002121_95a9d614-d5df-49f1-85ee-ea0ec2b66ba4); Time taken: 0.818 seconds
INFO : Concurrency mode is disabled, not creating a lock manager
INFO : Executing command(queryId=hadoop_20190729002121_95a9d614-d5df-49f1-85ee-ea0ec2b66ba4): show databases
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hadoop_20190729002121_95a9d614-d5df-49f1-85ee-ea0ec2b66ba4); Time taken: 0.064 seconds
INFO : OK
0: jdbc:hive2://hadoop001:10000/d1_hive>