淘宝双11大数据分析(Hive 分析篇-上)

文章目录

  • 前言
  • 当前环境
  • 用聚合函数 count() 计算出表内有多少条数据
  • 查出 user_id 不重复的数据的条数
  • 查询不重复的数据有多少条(为了排除客户刷单的情况)
  • 关键字条件查询分析
    • 查询双11那天有多少人购买了商品
    • 求当天购买的此品牌商品的数量
  • 根据用户行为分析
    • 查询一件商品在某天的购买比例或浏览比例
    • 查询双11那天,男女买家购买商品的比例
    • 给定购买商品的数量范围,查询某一天在该网站的购买该数量商品的用户id
  • 用户实时查询分析

前言

阅读此篇文章,需要先阅读前篇:

  1. 淘宝双11大数据分析(环境篇)
  2. 淘宝双11大数据分析(数据准备篇)

当前环境

使用 jps 查看当前启动的服务:(其中 Master 和 Worker是 Spark 的服务,不本篇无关)

[root@centos2020 dataset]# jps
11408 Master
12707 RunJar
7876 NameNode
8183 ResourceManager
7930 DataNode
8477 NodeManager
11550 Worker
12990 Jps

另外,本篇需要先启动 mysql ,hive服务(启动步骤请参考:淘宝双11大数据分析(数据准备篇))。

用聚合函数 count() 计算出表内有多少条数据

hive> select count(*) from user_log;

执行过程和结果(文章篇幅原因,后边的运行只展示结果,不展示过程):

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200226001346_be1efc88-10bf-4030-9509-62de994fb0d2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
Starting Job = job_1582623436135_0001, Tracking URL = http://centos2020:8088/proxy/application_1582623436135_0001/
Kill Command = /usr/hadoop/hadoop-2.7.7/bin/hadoop job  -kill job_1582623436135_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-02-26 00:14:53,555 Stage-1 map = 0%,  reduce = 0%
2020-02-26 00:15:16,085 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.0 sec
2020-02-26 00:15:39,527 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.35 sec
MapReduce Total cumulative CPU time: 6 seconds 350 msec
Ended Job = job_1582623436135_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.35 sec   HDFS Read: 482200 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 350 msec
OK
10000
Time taken: 115.187 seconds, Fetched: 1 row(s)

结果是 10000 ,结果正确。因为在上一篇中,给该表中存的数据就是 10000 条。

查出 user_id 不重复的数据的条数

hive> select count(distinct user_id) from user_log;

结果:

MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.41 sec   HDFS Read: 482582 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 410 msec
OK
358
Time taken: 45.244 seconds, Fetched: 1 row(s)

查询不重复的数据有多少条(为了排除客户刷单的情况)

hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1)a;

结果:

MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.93 sec   HDFS Read: 485070 HDFS Write: 116 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 4.33 sec   HDFS Read: 5058 HDFS Write: 104 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 260 msec
OK
4754
Time taken: 89.299 seconds, Fetched: 1 row(s)

注意以上的查询语句中,要加上别名 a 否则会报错:

hive> select count(*) from (select user_id,item_id,cat_id,merchant_id,brand_id,month,day,action from user_log group by user_id,item_id,cat_id,merchant_id,brand_id,month,day,action having count(*)=1);
NoViableAltException(256@[])
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.atomjoinSource(HiveParser_FromClauseParser.java:2265)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:2475)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:1690)
	at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1312)
	at org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:42074)
	at org.apache.hadoop.hive.ql.parse.HiveParser.atomSelectStatement(HiveParser.java:36735)
	at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:36987)
	at org.apache.hadoop.hive.ql.parse.HiveParser.regularBody(HiveParser.java:36633)
	at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpressionBody(HiveParser.java:35822)
	at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:35710)
	at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:2284)
	at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1333)
	at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:208)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)
	at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)
	at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)
	at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1457)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1237)
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1227)
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:403)
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:686)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
FAILED: ParseException line 1:22 cannot recognize input near '(' 'select' 'user_id' in joinSource

关键字条件查询分析

查询双11那天有多少人购买了商品

hive> select count(distinct user_id) from user_log where action='2';

结果:

MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 5.76 sec   HDFS Read: 483409 HDFS Write: 103 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 760 msec
OK
358
Time taken: 49.434 seconds, Fetched: 1 row(s)

求当天购买的此品牌商品的数量

hive> select count(*) from user_log where action='2' and brand_id=2661;

结果:

MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.46 sec   HDFS Read: 483258 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 460 msec
OK
3
Time taken: 52.679 seconds, Fetched: 1 row(s)

根据用户行为分析

查询一件商品在某天的购买比例或浏览比例

先查出有多少用户在这天购买了商品:

即关键字查询中那条语句:

select count(distinct user_id) from user_log where action='2';

其查询结果是 358。

再查出有多少用户点击了该商品:

即查询 user_id 不重复的数据条数:

select count(distinct user_id) from user_log;

其结果是 358。

也就是说比例是 358/358 = 1。

查询双11那天,男女买家购买商品的比例

先查出男性买家买商品的数量:

select count(*) from user_log where gender=1;

结果是:3299

再查出女性买家买商品的数量:

select count(*) from user_log where gender=0;

结果是:3361

因此男女比例是:3299 / 3361 大约等于 0.98

给定购买商品的数量范围,查询某一天在该网站的购买该数量商品的用户id

查询某一天在该网站购买商品超过5次的用户id:

select user_id from user_log where action='2' group by user_id having count(action='2')>5;

结果(篇幅原因,展示部分结果):

MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.77 sec   HDFS Read: 483294 HDFS Write: 1304 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 770 msec
OK
1321
6058
16464
...
422917
Time taken: 53.104 seconds, Fetched: 65 row(s)

用户实时查询分析

查询不同品牌的购买次数

这次创建新的数据表进行存储:

create table scan(brand_id INT,scan INT) COMMENT 'This is the search of bigdatataobao' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

导入数据:

insert overwrite table scan select brand_id,count(action) from user_log where action='2' group by brand_id;

查看结果:

select * from scan;

结果(篇幅原因,展示部分结果):

OK
NULL	8
60	3
69	1
82	11
99	3
104	1
125	1
127	1
133	1
...
8396	3
8410	1
8461	1
Time taken: 0.852 seconds, Fetched: 544 row(s)

你可能感兴趣的:(hive,Hadoop)