阅读此篇文章,需要先阅读前篇:
使用 jps 查看当前启动的服务:(其中 Master 和 Worker是 Spark 的服务,不本篇无关)
[root@centos2020 dataset]# jps
11408 Master
12707 RunJar
7876 NameNode
8183 ResourceManager
7930 DataNode
8477 NodeManager
11550 Worker
12990 Jps
另外,本篇需要先启动 mysql ,hive服务(启动步骤请参考:淘宝双11大数据分析(数据准备篇))。
hive> select count(*) from user_log;
执行过程和结果(文章篇幅原因,后边的运行只展示结果,不展示过程):
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200226001346_be1efc88-10bf-4030-9509-62de994fb0d2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1582623436135_0001, Tracking URL = http://centos2020:8088/proxy/application_1582623436135_0001/
Kill Command = /usr/hadoop/hadoop-2.7.7/bin/hadoop job -kill job_1582623436135_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-02-26 00:14:53,555 Stage-1 map = 0%, reduce = 0%
2020-02-26 00:15:16,085 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.0 sec
2020-02-26 00:15:39,527 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.35 sec
MapReduce Total cumulative CPU time: 6 seconds 350 msec
Ended Job = job_1582623436135_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.35 sec HDFS Read: 482200 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 350 msec
OK
10000
Time taken: 115.187 seconds, Fetched: 1 row(s)
结果是 10000 ,结果正确。因为在上一篇中,给该表中存的数据就是 10000 条。
hive> select count(distinct user_id) from user_log;
结果:
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.41 sec HDFS Read: 482582 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 410 msec
OK
358
Time taken: 45.244 seconds, Fetched: 1 row(s)
hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1)a;
结果:
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.93 sec HDFS Read: 485070 HDFS Write: 116 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 4.33 sec HDFS Read: 5058 HDFS Write: 104 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 260 msec
OK
4754
Time taken: 89.299 seconds, Fetched: 1 row(s)
注意以上的查询语句中,要加上别名 a
否则会报错:
hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1);
NoViableAltException(256@[])
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.atomjoinSource(HiveParser_FromClauseParser.java:2265)
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:2475)
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:1690)
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1312)
at org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:42074)
at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36735)
at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987)
at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36633)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35822)
at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35710)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2284)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1333)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
FAILED: ParseException line 1:22 cannot recognize input near '(' 'select' 'user_id' in joinSource
hive> select count(distinct user_id) from user_log where action='2';
结果:
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.76 sec HDFS Read: 483409 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 760 msec
OK
358
Time taken: 49.434 seconds, Fetched: 1 row(s)
hive> select count(*) from user_log where action='2' and brand_id=2661;
结果:
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.46 sec HDFS Read: 483258 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 460 msec
OK
3
Time taken: 52.679 seconds, Fetched: 1 row(s)
先查出有多少用户在这天购买了商品:
即关键字查询中那条语句:
select count(distinct user_id) from user_log where action='2';
其查询结果是 358。
再查出有多少用户点击了该商品:
即查询 user_id 不重复的数据条数:
select count(distinct user_id) from user_log;
其结果是 358。
也就是说比例是 358/358 = 1。
先查出男性买家买商品的数量:
select count(*) from user_log where gender=1;
结果是:3299
再查出女性买家买商品的数量:
select count(*) from user_log where gender=0;
结果是:3361
因此男女比例是:3299 / 3361 大约等于 0.98
查询某一天在该网站购买商品超过5次的用户id:
select user_id from user_log where action='2' group by user_id having count(action='2')>5;
结果(篇幅原因,展示部分结果):
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.77 sec HDFS Read: 483294 HDFS Write: 1304 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 770 msec
OK
1321
6058
16464
...
422917
Time taken: 53.104 seconds, Fetched: 65 row(s)
查询不同品牌的购买次数
这次创建新的数据表进行存储:
create table scan(brand_id INT,scan INT) COMMENT 'This is the search of bigdatataobao' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
导入数据:
insert overwrite table scan select brand_id,count(action) from user_log where action='2' group by brand_id;
查看结果:
select * from scan;
结果(篇幅原因,展示部分结果):
OK
NULL 8
60 3
69 1
82 11
99 3
104 1
125 1
127 1
133 1
...
8396 3
8410 1
8461 1
Time taken: 0.852 seconds, Fetched: 544 row(s)