Hive 常见笔试题
一、累计次数
1.1、 编写sql实现每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数
数据集
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11
最后结果展示:
用户 月份 最大访问次数 总访问次数 当月访问次数
A 2015-01 33 33 33
A 2015-02 33 43 10
A 2015-03 38 81 38
B 2015-01 30 30 30
B 2015-02 30 45 15
B 2015-03 44 89 44
step1:首先统计出每个用户每月的访问次数
step2:然后调用开窗函数,按照userid进行分许,按照month进行升序排序,结算截至当月的累计访问次数和最大访问次数
select id
,month
,visits
,max(visits) over(partition by id order by month ) max_visits
,sum(visits) over(partition by id order by month ) sum_visits
from
(select id
,month
,sum(visits) as visits
from visits
group by id,month
) t1;
注:over( order by )开窗范围 rows between unbounded preceding and current row
1.2 使用SQL统计出每个用户的累积访问次数
我们有如下的用户访问数据
userId visitDate visitCount
u01 2017/1/21 5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
U02 2017/1/23 6
U01 2017/2/22 4
要求使用SQL统计出每个用户的累积访问次数,如下表所示:
用户id 月份 小计 累积
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3
step1:对时间进行处理,方便按照月份进行聚合
step2:统计每个用户,每月的访问次数
step3:调用over()开窗函数,按照userid 分区,按照月份进行升序排序,统计累计访问次数
数据准备:
CREATE TABLE test_sql.test1 (
userId string,
visitDate string,
visitCount INT )
ROW format delimited FIELDS TERMINATED BY "\t";
INSERT INTO TABLE test_sql.test1
VALUES
( 'u01', '2017/1/21', 5 ),
( 'u02', '2017/1/23', 6 ),
( 'u03', '2017/1/22', 8 ),
( 'u04', '2017/1/20', 3 ),
( 'u01', '2017/1/23', 6 ),
( 'u01', '2017/2/21', 8 ),
( 'u02', '2017/1/23', 6 ),
( 'u01', '2017/2/22', 4 );
查询SQL:
SELECT t2.userid,
t2.visitmonth,
subtotal_visit_cnt,
sum(subtotal_visit_cnt) over (partition BY userid ORDER BY visitmonth) AS total_visit_cnt
FROM
(SELECT userid,
visitmonth,
sum(visitcount) AS subtotal_visit_cnt
FROM
(SELECT userid,
date_format(regexp_replace(visitdate,'/','-'),'yyyy-MM') AS visitmonth,
visitcount
FROM test_sql.test1
) t1
GROUP BY userid,visitmonth
)t2
ORDER BY t2.userid,
t2.visitmonth
1.3求出每个栏目的被观看次数及累计观看时长
数据集
Uid channle min
1 1 23
2 1 12
3 1 12
4 1 32
5 1 342
6 2 13
7 2 34
8 2 13
9 2 134
结果
channl count min
1 5 421
2 4 194
select channl
,count(channl)
,sum(min)
from vedio
group by channl;
二、连续问题
2.1、编写连续7天登录的用户
数据集
Uid dt login_status(1登录成功,0异常)
1 2019-07-11 1
1 2019-07-12 1
1 2019-07-13 1
1 2019-07-14 1
1 2019-07-15 1
1 2019-07-16 1
1 2019-07-17 1
1 2019-07-18 1
2 2019-07-11 1
2 2019-07-12 1
2 2019-07-13 0
2 2019-07-14 1
2 2019-07-15 1
2 2019-07-16 0
2 2019-07-17 1
2 2019-07-18 0
3 2019-07-11 1
3 2019-07-12 1
3 2019-07-13 1
3 2019-07-14 1
3 2019-07-15 1
3 2019-07-16 1
3 2019-07-17 1
3 2019-07-18 1
step1:首先调用over开窗函数,按照userid分区按照时间升序排序,计算出排名
step2:将date-排名的到差值日期
step3:统计用户id,dt进行聚合,统计出count(*)>=7用户
select uid
from
(select uid
,date_sub(dt,rn) as dt
from
(select uid
,dt
,row_number() over(partition by uid order by dt asc) rn
from login
where login_status=1 ) t1
) t2
group by uid,dt
having count(1)>=7;
2.2 给定一张用户签到表user_attendence,表中包含三个字段,分别是user_id,date,is_sign_in,0否1是。
<问题1> 计算截至当前,每个用户已经连续签到的天数:要求输出用户ID【user_id】和连续签到天数【recent_continuous_days】
针对问题1,这里有一种非常巧妙的解法:只需要利用Max和datediff函数,也就是说只要找到用户最近一次没有签到的日期,计算和当前日期的差值即可。
select user_id
,datediff('2021-09-23',max_date) as recent_continuous_days
from
(
select user_id
, max(date) as max_date
from user_attendence
where is_sign_in=0
group by user_id
)
<问题2> 计算有史以来用户最大连续签到天数:要求输出用户ID【user_id】和最大连续签到天数【max_continuous_days】
针对问题2,用每行的签到日期减去序号,如果签到日期是连续的话,求得的值则是相同值,一旦日期不连续,将会求得一个新的相同值,这样的话,可以统计每个值的数量,进而判定最长签到天数。
select user_id
,max(continues_day) as max_continuous_days
from
(select user_id
,count(*) as continues_day
from
(select user_id
,date_sub(date,rn) as dt
from
(select user_id
,date
,row_number() over(partition by user_id order by date ) as rn
from user_attendence
where is_sign_in =1) a
) b
group by user_id,dt) c
group by user_id
2.3 求连续点击三次的用户数,而且中间不能有别人的点击
create table myschema.click(
userid VARCHAR (20) NOT NULL,
clickorder int not null
);
select * from myschema.click;
INSERT INTO myschema.click (userid,clickorder) values
('g',10),('a',11),('c',12),('d',13),('d',14),
('a',15),('a',16),('a',17),('d',18),('e',19),('e',20),('e',21),('e',22);
step1:按时间排序 rank1 :1 2 3 4 5 6
step2:按用户分组,按时间排序 rank2:1 2 1 1 2 3
step3:计算 rank1 - rank2,记为diff:0 0 2 3 3 3
step4:对userid,diff分组计数=3
select user_id
from
(SELECT *,a.rank_1 - a.rank_2 as diff
FROM
(select *
,row_number() over(order by clickorder) as rank_1
,row_number() over(partition by userid order by clickorder) as rank_2
from click
) a
) group by user_id ,diff having count(1)>=3;
2.4 求得所有用户和活跃用户的总数及平均年龄
日期 用户 年龄
11,test_1,23
11,test_2,19
11,test_3,39
11,test_1,23
11,test_3,39
11,test_1,23
12,test_2,19
13,test_1,23
1、
select
count(*) sum_user,
avg(age) avg_age
from
(
select
user_id,
avg(age) age
from test_five_active
group by user_id
) t1;
select
count(*),
avg(d.age)
from
(
select
c.user_id,
c.age
from
(
select
b.user_id,
b.age,
b.diff,
count(*) flag
from
(
select
a.active_time,
a.user_id,
a.age,
a.rank_time,
a.active_time-a.rank_time diff
from
(
select
active_time,
user_id,
age,
rank() over(partition by user_id order by active_time) rank_time
from test_five_active
group by active_time,user_id,age
) a
) b
group by b.user_id,b.age,b.diff
having count(*) >=2
) c
group by c.user_id,c.age
) d;
三、TOPN
3.1、每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
有50W个京东店铺,每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志,
访问日志存储的表名为Visit,访客的用户id为user_id,被访问的店铺名称为shop,数据如下:
u1 a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 b
u2 a
u2 a
u3 a
u5 a
u5 a
u5 a
step1:统计每个用户对于某一个店铺的访问总次数
step2:利用窗口函数按照店铺分区,访问次数降序排序
step3:取出排名小于等于3
SELECT t2.shop,
t2.user_id,
t2.cnt
FROM
(SELECT t1.*,
rank() over(partition BY t1.shop
ORDER BY t1.cnt DESC) rank
FROM
(SELECT user_id,
shop,
count(*) AS cnt
FROM test_sql.test2
GROUP BY user_id,
shop) t1)t2
WHERE rank <= 3 ;
3.2 编写sql语句实现每班前三名,分数一样并列,同时求出前三名按名次排序的一次的分差
数据集
1 1901 90
2 1901 90
3 1901 83
4 1901 60
5 1902 66
6 1902 23
7 1902 99
8 1902 67
9 1902 87
create table if not exists stu(
stu_no string,
class string,
score int
)
row format delimited fields terminated by ' '
lines terminated by '\n'
stored as textfile
;
step1:调用开窗函数按照班级进行分区,按照分数进行降序排序。
step2:利用lag()函数计算前一名的分数,计算差值
select
class,
stu_no,
score,
num,
score-nvl(lag(score) over(distribute by class sort by num),0)
from
(select
stu_no,
class,
score,
rank() over(distribute by class sort by score desc) num
from stu) t1
where t1.num<4
;
注:NVL(exp1,exp2) 如果expr1为NULL,返回值为 expr2,否则返回expr1。
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值.第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
四、用户新增
4.1、统计每日新用户数量
网站用户活动表(User_Activities),表中包含字段UserId(用户编号)、Activity(用户活动,可取值login、orders、logout)、Date(活动日期),现需要编写一个查询语句,统计网站每日的新用户数量,即统计每日登录用户中首次登录网站的用户数。
step1:按照user_id 进行聚合,计算最早的登陆日期
step2:将step1作为临时表,按照登录日期进行聚合,分组统计每个login_date的用户数量
select login_date
,count(UserId)as user_count
from(
select UserId,min(Date)as login_date
from User_Activities
where Activity='login'
group by UserId
)ua
group by login_date
order by login_date;
4.2 统计某月的新增用户数
step1:按照userid进行聚合
step2:使用having 挑选出最早登录日期为 2017-11 的
step3: 使用count 统计用户数
SELECT count(user_id)
FROM test_sql.test3
GROUP BY user_id
HAVING date_format(min(dt),'yyyy-MM')='2017-11';
4.3 分析用户行为习惯,找到每一个用户在表中的第一次行为
数据集
1,time1,read
3,time2,comment
1,time3,share
2,time4,like
1,time5,write
2,time6,like
3,time7,write
2,time8,read
create table if not exists user_action_log(
uid string,
time string,
action string
)
row format delimited fields terminated by ',';
select t1.uid
,t1.time
,t1.action
from user_action_log t1
join
(select uid as uid
,min(time) as time
from user_action_log
group by uid) t2
on t1.uid=t2.uid
and t1.time=t2.time
;
或者
select
*
from
(select
first_value(uid) over(partition by uid order by time asc) uid
,first_value(time) over(partition by uid order by time asc) time
,first_value(action) over(partition by uid order by time asc) action
from user_action_log) t1
group by uid,time,action;
五、行列互换
5.1、订单及订单类型行列互换
数据源
order_id order_type order_time
111 N 10:00
111 A 10:05
111 B 10:10
是用hql获取结果如下:
order_id order_type_1 order_type_2 order_time_1 order_time_2
111 N A 10:00 10:05
111 A B 10:05 10:10
select *
from
(select order_id
,order_type as type1
,lead(order_type) over(partition by order_id order by order_time) as type2
,order_time as time1
,lead(order_time) over(partition by order_id order by order_time) as time2
from order_type
) as t1
where t1.type2 is not null
六、其他
6.1 计算除去部门最高工资,和最低工资的平均工资
step1:使用rank(),按照部分分区,分别按照工资进行升序,降序排序
step2:将step1作为临时表,分别取出排名第一个高的最高工资,和排名第一低的最低工资
step3:按照部门聚合,求取平均工资
select a.deptno,avg(a.salary)
from
(
select *
,rank() over( partition by deptno order by salary ) as rank_1
,rank() over( partition by deptno order by salary desc) as rank_2
from emp
) a
where a.rank_1 >1 and a.rank_2 >1
group by a.deptno
6.2 、某天每个直播间最大在线人数或者什么时候直播间人数达到峰值
一张表有如下字段:
1)直播间: live_id
2)用户:userid
3)时间戳:date_stamp
4)登陆类型:entry_type (登入和登出)
求直播间在某一天同一时间点的最大在线用户数?
step1:根据用户的进入和出去使用case when 进行类型判断,进入为1,出去为-1
step2:使用sum开窗,按照直播间进行分区,按照时间升序排序,某一时刻直播间的人数。
step3:使用row_number(),按照直播间进行分区,在线人数进行降序拍序,取第一个
select live_id
,date_stamp
,sumflag
from
(select live_id
,userid
,date_stamp
,sumflag
,row_number()over(partiotion by live_id order by sumflag desc) rn
from
(select live_id
,userid
,date_stamp
,sum(flag) over (partiotion by live_id order by date_stamp) sumflag
from
(select live_id
, userid
, date_stamp as date_stamp
, case when entry_type = 'enter' then 1
when entry_type = 'out' then -1
else 0 end as flag
from live_tab lt1
)a
) b
) c
where rn=1;
七、 你知道的排名函数有哪些?说一说它们之间的区别?
row_number() over():没有并列,相同名次依顺序排
rank() over():有并列,相同名次空位
dense_rank() over():有并列,相同名次不空位
NVL(exp1,exp2) 如果expr1为NULL,返回值为 expr2,否则返回expr1。
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值.第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
Lead(col,n,DEFAULT) 用于统计窗口内往下第n行值.第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)