Hive(四)函数(json_tuple和parse_url_tuple)/topN通用解法/Beeline连接

json_tuple

创建一个只有一个string类型的字段来存放json数据的表,将下列类型数据load进表中:

hive (d1_hive)> select * from rating_json limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.311 seconds, Fetched: 10 row(s)
hive (d1_hive)>

如何将json数据拆分出来呢?使用json_tuple函数能够简单处理这种事。

hive (d1_hive)> select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id)  from rating_json limit 10;
OK
movie_id	rate	time	user_id
1193	5	978300760	1
661	3	978302109	1
914	3	978301968	1
3408	4	978300275	1
2355	5	978824291	1
1197	3	978302268	1
1287	5	978302039	1
2804	5	978300719	1
594	4	978302268	1
919	4	978301368	1
Time taken: 0.121 seconds, Fetched: 10 row(s)
hive (d1_hive)>

针对上述例子,在生产上一般来说是要将time再次处理成时间戳/年/月/日等,新成一张大宽表,以便后续会用到。

hive (d1_hive)> select movie_id,rate,user_id,
              > from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss') as time,
              > year(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as year,
              > month(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as month,
              > day(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as day
              > from (
              > select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id)  from rating_json
              > ) t limit 10;
OK
movie_id	rate	user_id	time	year	month	day
1193	5	1	2001-01-01 06:12:40	2001	1	1
661	3	1	2001-01-01 06:35:09	2001	1	1
914	3	1	2001-01-01 06:32:48	2001	1	1
3408	4	1	2001-01-01 06:04:35	2001	1	1
2355	5	1	2001-01-07 07:38:11	2001	1	7
1197	3	1	2001-01-01 06:37:48	2001	1	1
1287	5	1	2001-01-01 06:33:59	2001	1	1
2804	5	1	2001-01-01 06:11:59	2001	1	1
594	4	1	2001-01-01 06:37:48	2001	1	1
919	4	1	2001-01-01 06:22:48	2001	1	1
Time taken: 0.176 seconds, Fetched: 10 row(s)
hive (d1_hive)>

# 大宽表,就是用来存储拆分结果的表,后续的统计分析都是基于这个大宽表进行
hive (d1_hive)> create table rate_movie as
              > select movie_id,rate,user_id,
              > from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss') as time,
              > year(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as year,
              > month(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as month,
              > day(from_unixtime(cast(time as BIGINT),'yyyy-MM-dd HH:mm:ss')) as day
              > from (
              > select json_tuple(json,'movie','rate','time','userid') as (movie_id,rate,time,user_id)  from rating_json
              > ) t;
Query ID = hadoop_20190728233636_919dc0ee-61ea-4bdf-a725-2c87f43def12
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1564328140990_0002, Tracking URL = http://localhost:4044/proxy/application_1564328140990_0002/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1564328140990_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-28 23:41:05,630 Stage-1 map = 0%,  reduce = 0%
2019-07-28 23:41:25,706 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 16.24 sec
MapReduce Total cumulative CPU time: 16 seconds 240 msec
Ended Job = job_1564328140990_0002
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/d1_hive.db/.hive-staging_hive_2019-07-28_23-40-57_676_3047408219089077111-1/-ext-10001
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/d1_hive.db/rate_movie
Table d1_hive.rate_movie stats: [numFiles=1, numRows=1000209, totalSize=41695855, rawDataSize=40695646]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 16.24 sec   HDFS Read: 63606529 HDFS Write: 41695940 SUCCESS
Total MapReduce CPU Time Spent: 16 seconds 240 msec
OK
movie_id	rate	user_id	time	year	month	day
Time taken: 30.415 seconds
hive (d1_hive)> select * from rate_movie limit 10;
OK
rate_movie.movie_id	rate_movie.rate	rate_movie.user_id	rate_movie.time	rate_movie.year	rate_movie.month	rate_movie.day
1193	5	1	2001-01-01 06:12:40	2001	1	1
661	3	1	2001-01-01 06:35:09	2001	1	1
914	3	1	2001-01-01 06:32:48	2001	1	1
3408	4	1	2001-01-01 06:04:35	2001	1	1
2355	5	1	2001-01-07 07:38:11	2001	1	7
1197	3	1	2001-01-01 06:37:48	2001	1	1
1287	5	1	2001-01-01 06:33:59	2001	1	1
2804	5	1	2001-01-01 06:11:59	2001	1	1
594	4	1	2001-01-01 06:37:48	2001	1	1
919	4	1	2001-01-01 06:22:48	2001	1	1
Time taken: 0.089 seconds, Fetched: 10 row(s)
hive (d1_hive)>

parse_url_tuple

创建一个只有一个string类型的字段来存放url数据的表,将下列类型数据load进表中:

hive (d1_hive)> create table url(url string);
OK
Time taken: 0.082 seconds
hive (d1_hive)> load data local inpath '/home/hadoop/data/url.txt' overwrite into table url;
Loading data to table d1_hive.url
Table d1_hive.url stats: [numFiles=1, numRows=0, totalSize=61, rawDataSize=0]
OK
Time taken: 0.333 seconds
hive (d1_hive)> select * from url;
OK
url.url
http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d
Time taken: 0.084 seconds, Fetched: 1 row(s)
hive (d1_hive)>

如何将url数据拆分出来呢?使用parse_url_tuple函数能够简单处理这种事。

hive (d1_hive)> select parse_url_tuple(url, 'HOST', 'PATH', 'QUERY', 'QUERY:cookieid','QUERY:a') as (host,path,query,cookie_id,a) from url;
OK
host	path	query	cookie_id	a
www.ruozedata.com	/d7/xxx.html	cookieid=1234567&a=b&c=d	1234567	b
Time taken: 0.089 seconds, Fetched: 1 row(s)
hive (d1_hive)>

topN通用解法

以上面解析json出来的大宽表为例,求每个用户评分最高的三部电影。主要使用到Analytics函数ROW_NUMBER,它让同一分区的进行排名。

hive (d1_hive)> select user_id,movie_id,rate,time from
              > (
              > select user_id,movie_id,rate,time,row_number() over(partition by user_id order by rate desc) as r from rate_movie
              > ) t where t.r<=3 limit 10;
Query ID = hadoop_20190728233636_919dc0ee-61ea-4bdf-a725-2c87f43def12
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1564328140990_0008, Tracking URL = http://localhost:4044/proxy/application_1564328140990_0008/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job  -kill job_1564328140990_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-07-29 00:13:11,504 Stage-1 map = 0%,  reduce = 0%
2019-07-29 00:13:23,046 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 7.28 sec
2019-07-29 00:13:33,574 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 12.01 sec
MapReduce Total cumulative CPU time: 12 seconds 10 msec
Ended Job = job_1564328140990_0008
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 12.01 sec   HDFS Read: 41704880 HDFS Write: 295 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 10 msec
OK
user_id	movie_id	rate	time
1	1035	5	2001-01-01 06:29:13
1	1	5	2001-01-07 07:37:48
1	2028	5	2001-01-01 06:26:59
10	3591	5	2000-12-31 10:08:45
10	3809	5	2000-12-31 10:07:31
10	954	5	2000-12-31 09:25:22
100	800	5	2000-12-24 01:51:55
100	527	5	2000-12-24 02:07:19
100	919	5	2000-12-24 02:09:07
1000	2571	5	2000-11-24 12:46:50
Time taken: 30.207 seconds, Fetched: 10 row(s)
hive (d1_hive)>

beeline连接

这里在beeline连接之前,必须先将hiveserver2启动。hiveserver2(HS2)是一种允许客户端对Hive执行查询的服务,只有先将这个服务启动,JDBC/beeline一类客户端才能访问。

# 启动hs2
[hadoop@10-9-15-140 bin]$ nohup ./hiveserver2 >> /tmp/hs2.log 2>&1 &
[1] 10556
[hadoop@10-9-15-140 bin]$ tail -F /tmp/hs2.log

# beeline连接,默认端口为10000
[hadoop@10-9-15-140 bin]$ ./beeline -u jdbc:hive2://hadoop001:10000/d1_hive -n hadoop
which: no hbase in (/home/hadoop/app/hive-1.1.0-cdh5.7.0/bin:/home/hadoop/app/protobuf-2.5.0/bin:/home/hadoop/app/apache-maven-3.3.9/bin:/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin:/usr/java/jdk1.7.0_80/bin:/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/hadoop/bin)
scan complete in 4ms
Connecting to jdbc:hive2://hadoop001:10000/d1_hive
Connected to: Apache Hive (version 1.1.0-cdh5.7.0)
Driver: Hive JDBC (version 1.1.0-cdh5.7.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.1.0-cdh5.7.0 by Apache Hive
0: jdbc:hive2://hadoop001:10000/d1_hive> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| d1_hive        |
| default        |
+----------------+--+
2 rows selected (1.241 seconds)
INFO  : Compiling command(queryId=hadoop_20190729002121_95a9d614-d5df-49f1-85ee-ea0ec2b66ba4): show databases
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO  : Completed compiling command(queryId=hadoop_20190729002121_95a9d614-d5df-49f1-85ee-ea0ec2b66ba4); Time taken: 0.818 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hadoop_20190729002121_95a9d614-d5df-49f1-85ee-ea0ec2b66ba4): show databases
INFO  : Starting task [Stage-0:DDL] in serial mode
INFO  : Completed executing command(queryId=hadoop_20190729002121_95a9d614-d5df-49f1-85ee-ea0ec2b66ba4); Time taken: 0.064 seconds
INFO  : OK
0: jdbc:hive2://hadoop001:10000/d1_hive>

你可能感兴趣的:(Hive)