前言
文中部分SQL,可能考虑不全面,欢迎大家指正
文章结构:
1.SQL中的进阶函数-窗口函数
2.拼多多面试题
case专题-商品订单数据
case专题-活动运营数据
case专题-用户行为路径分析
case专题-用户留存分析
case专题-统计特征(中位数,众数,四分位数)
case专题-GMV周同比统计
case专题-连续区间问题
3.猿辅导面试题
case专题-学生成绩分析
case专题-学生做题情况分析
4.Hive面试高频知识点-行转列
一.SQL中的进阶函数-窗口函数
1.窗口函数
窗口函数和普通聚合函数的区别:
①聚合函数是将多条记录聚合为一条;窗口函数是每条记录都会执行,有几条记录执行完还是几条。
②聚合函数也可以用于窗口函数。
原因就在于窗口函数的执行顺序(逻辑上的)是在FROM,JOIN,WHERE,GROUP BY,HAVING之后,在ORDER BY,LIMIT,SELECT DISTINCT之前。它执行时GROUP BY的聚合过程已经完成了,所以不会再产生数据聚合。
注:窗口函数是在where之后执行的,所以如果where子句需要用窗口函数作为条件,需要多一层查询,在子查询外面进行,例如:
select user_id,avg(diff)
from
(
select user_id,lead(log_time)over(partition user_id order by log_time) - log_time as diff
from user_log
)t
where datediff(now(),t.log_time)<=30
group by user_id
2.窗口函数的基本用法:
函数名 OVER 子句
over关键字用来指定函数执行的窗口范围,若后面括号中什么都不写,则意味着窗口包含满足WHERE条件的所有行,窗口函数基于所有行进行计算;如果不为空,则支持以下4中语法来设置窗口。
①window_name:给窗口指定一个别名。如果SQL中涉及的窗口较多,采用别名可以看起来更清晰易读
②partition by子句:窗口按照哪些字段进行分组,窗口函数在不同的分组上分别执行
③order by子句:按照哪些字段进行排序,窗口函数将按照排序后的记录顺序进行编号
④frame子句:frame是当前分区的一个子集,子句用来定义子集的规则,通常用来作为滑动窗口使用
3.(面试考点)序号函数:row_number(),rank(),dense_rank()的区别
ROW_NUMBER():顺序排序——1、2、3
RANK():并列排序,跳过重复序号——1、1、3
DENSE_RANK():并列排序,不跳过重复序号——1、1、2
4.分布函数:percent_rank(),cume_dist()
percent_rank():
每行按照公式(rank-1) / (rows-1)进行计算。其中,rank为RANK()函数产生的序号,rows为当前窗口的记录总行数
--给窗口指定别名:WINDOW w AS (PARTITION BY stu_id ORDER BY score) rows = 5
mysql> SELECT
-> RANK() OVER w AS rk,
-> PERCENT_RANK() OVER w AS prk,
-> stu_id, lesson_id, score
-> FROM t_score
-> WHERE stu_id = 1
-> WINDOW w AS (PARTITION BY stu_id ORDER BY score)
-> ;
+----+------+--------+-----------+-------+
| rk | prk | stu_id | lesson_id | score |
+----+------+--------+-----------+-------+
| 1 | 0 | 1 | L003 | 79 |
| 2 | 0.25 | 1 | L002 | 86 |
| 3 | 0.5 | 1 | L004 | 88 |
| 4 | 0.75 | 1 | L005 | 98 |
| 4 | 0.75 | 1 | L001 | 98 |
+----+------+--------+-----------+-------+
cume_dist():
分组内小于、等于当前rank值的行数 / 分组内总行数 eg:查询小于等于当前成绩(score)的比例
--cd1:没有分区,则所有数据均为一组,总行数为8
--cd2:按照lesson_id分成了两组,行数各为4
mysql> SELECT stu_id, lesson_id, score,
-> CUME_DIST() OVER (ORDER BY score) AS cd1,
-> CUME_DIST() OVER (PARTITION BY lesson_id ORDER BY score) AS cd2
-> FROM t_score
-> WHERE lesson_id IN ('L001','L002')
-> ;
+--------+-----------+-------+-------+------+
| stu_id | lesson_id | score | cd1 | cd2 |
+--------+-----------+-------+-------+------+
| 2 | L001 | 84 | 0.125 | 0.25 |
| 1 | L001 | 98 | 0.75 | 0.5 |
| 4 | L001 | 99 | 0.875 | 0.75 |
| 3 | L001 | 100 | 1 | 1 |
| 1 | L002 | 86 | 0.25 | 0.25 |
| 4 | L002 | 88 | 0.375 | 0.5 |
| 2 | L002 | 90 | 0.5 | 0.75 |
| 3 | L002 | 91 | 0.625 | 1 |
+--------+-----------+-------+-------+------+
5.前后函数:lag(expr,n),lead(expr,n)用途:返回位于当前行的前n行(LAG(expr,n))或后n行(LEAD(expr,n))的expr的值
应用场景:查询前1名同学的成绩和当前同学成绩的差值
mysql> SELECT stu_id, lesson_id, score, pre_score,
-> score-pre_score AS diff
-> FROM(
-> SELECT stu_id, lesson_id, score,
-> LAG(score,1) OVER w AS pre_score
-> FROM t_score
-> WHERE lesson_id IN ('L001','L002')
-> WINDOW w AS (PARTITION BY lesson_id ORDER BY score)) t
-> ;
+--------+-----------+-------+-----------+------+
| stu_id | lesson_id | score | pre_score | diff |
+--------+-----------+-------+-----------+------+
| 2 | L001 | 84 | NULL | NULL |
| 1 | L001 | 98 | 84 | 14 |
| 4 | L001 | 99 | 98 | 1 |
| 3 | L001 | 100 | 99 | 1 |
| 1 | L002 | 86 | NULL | NULL |
| 4 | L002 | 88 | 86 | 2 |
| 2 | L002 | 90 | 88 | 2 |
| 3 | L002 | 91 | 90 | 1 |
+--------+-----------+-------+-----------+------+
6.头尾函数:FIRST_VALUE(expr),LAST_VALUE(expr)用途:返回第一个(FIRST_VALUE(expr))或最后一个(LAST_VALUE(expr))expr的值
应用场景:截止到当前成绩,按照日期排序查询第1个和最后1个同学的分数
mysql> SELECT stu_id, lesson_id, score, create_time,
-> FIRST_VALUE(score) OVER w AS first_score,
-> LAST_VALUE(score) OVER w AS last_score
-> FROM t_score
-> WHERE lesson_id IN ('L001','L002')
-> WINDOW w AS (PARTITION BY lesson_id ORDER BY create_time)
-> ;
+--------+-----------+-------+-------------+-------------+------------+
| stu_id | lesson_id | score | create_time | first_score | last_score |
+--------+-----------+-------+-------------+-------------+------------+
| 3 | L001 | 100 | 2018-08-07 | 100 | 100 |
| 1 | L001 | 98 | 2018-08-08 | 100 | 98 |
| 2 | L001 | 84 | 2018-08-09 | 100 | 99 |
| 4 | L001 | 99 | 2018-08-09 | 100 | 99 |
| 3 | L002 | 91 | 2018-08-07 | 91 | 91 |
| 1 | L002 | 86 | 2018-08-08 | 91 | 86 |
| 2 | L002 | 90 | 2018-08-09 | 91 | 90 |
| 4 | L002 | 88 | 2018-08-10 | 91 | 88 |
+--------+-----------+-------+-------------+-------------+------------+
拼多多面试题
注:部分来源于笔试,面试部分因为都是基于其他人面经口述记录,所以难免有一些在格式不统一的地方。
二.case专题-商品订单数据
数据表:
订单表orders,大概字段有(order_id'订单号,'user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’,'商品一级类目commodity_level_I','商品二级类目commodity_level_2')
1.求最近7天内每一个一级类目下成交总额排名前3的二级类目:
select commodity_level_1 as '商品一级类目',
commodity_level_2 as '商品二级类目',
total_pay as '成交总额'
from
(
select commodity_level_1,commodity_level_2,total_pay
row_number()over(partition by commodity_level_1 order by a.total_pay desc) as rank
from
(
select commodity_level_1,
commodity_level_2,
sum(order_pay) as total_pay
from orders
where datediff(now(),order_time) <= 7
group by commodity_level_1,commodity_level_2,
) a
) b
where rank <= 3
2.提取8.1-8.10每一天消费金额排名在101-195的user_id
select order_date as '订单日期',
user_id,
total_pay as '消费金额'
from
(
select order_date,user_id,total_pay
row_number()over(partition by order_date order by a.total_pay desc) as rank
from
(
select convert(order_time,date) as order_date,
user_id,
sum(order_pay) as total_pay
from orders
where convert(order_time,date) between '20190801' and '20190810'
group by convert(order_time,date),user_id
) a
) b
where rank between 101 and 195
三.case专题-活动运营数据分析
数据表
表1——订单表orders,大概字段有(user_id‘用户编号’, order_pay‘订单金额’ , order_time‘下单时间’)
表2——活动报名表act_apply,大概字段有(act_id‘活动编号’, user_id‘报名用户’,act_time‘报名时间’)
1.活动运营数据分析-统计每个活动对应所有用户在报名后产生的总订单金额,总订单数
select act_id as '活动编号', COUNT(*) as '订单数', SUM(order_pay) as '总金额'
from act_apply left join orders on orders.user_id=act_apply.user_id
where order_time >= act_time
GROUP BY act_id
2.统计每个活动从开始后到当天(考试日)平均每天产生的订单数,活动开始时间定义为最早有用户报名的时间。(涉及到时间的数据类型均为:datetime)
解题思路:
(1)找到活动最早用户报名的时间 用窗口函数min()over()
(2)计算最早报名时间到当前时间的天数差,datediff(now(),act_start)
select act_id as '活动编号',count(*)/datediff(now(),act_start)
from orders a
left join
(
select act_id,user_id,act_time,min()over(partition by act_id order act_time) as act_start
from act_apply
)b
on a.user_id = b.user_id
where order_time >= act_time
group by act_id,act_start
四.case专题-用户行为路径分析
表1——用户行为表tracking_log,大概字段有(user_id‘用户编号’,opr_id‘操作编号’,log_time‘操作时间’)
1.统计每天符合以下条件的用户数:A操作之后是B操作,AB操作必须相邻
解题思路:运用窗口函数在每行数据生成一列数据:下次操作编号
注意要把datetime格式转换成date格式来表示天
select Date,count(*) as 'A-B路径用户计数'
from
(
select distinct user_id,opr_id as cur_opr,convert(log_time,date) as 'Date',
lead(opr_id,1)over(partition by user_id,convert(log_time,date) order by log_time) as next_opr
from tracking_log
) t
where where t.cur_opr = A and t.next_opr=B
group by Date
2.统计用户行为序列为A-B-D的用户数
其中:A-B之间可以有任何其他浏览记录(如C,E等),B-D之间除了C记录可以有任何其他浏览记录(如A,E等)
select count(user_id) as result
from
(
select user_id,group_concat(opr_id order by log_time) as user_behavior_path
from tracking_log
group by user_id
having (user_behavior_path like '%A%B%D%')
and (user_behavior_path not like '%A%B%C%D%')
) t
五.case专题-用户留存分析
表1——用户登陆表user_log,大概字段有(user_id‘用户编号’,log_date‘登陆时间’)
1.求每天新增用户数,以及他们第2天、30天的留存率
解法一:不要窗口函数,case when
select first_date as '日期',
count(distinct l.user_id) as '新增用户数',
round(
count(distinct case when datediff(log_date,first_date) = 1 then l.user_id else null end) /
count(distinct l.user_id)
,
2
) as '次日留存率',
round(
count(distinct case when datediff(log_date,first_date) = 29 then l.user_id else null end) /
count(distinct l.user_id)
,
2
) as '30日留存率',
from user_log l left join
(
select user_id,min(log_date) as first_date
from user_log
group by user_id
)t on l.user_id = t.user_id
group by first_date
解法二:
select a.date as '日期',a.new as '新增用户',
b.2_back / a.new as '第2天回访率',
c.30_back / a.new as '第3天回访率',
from
(
select convert(aa.log_date,date) as 'date',count(distinct user_id) as 'new'
from
(
--找到每个用户第一次登陆时间,按第一次登陆日期聚合,统计user_id,即每天新增用户量
select user_id,min(log_date)over(partition by user_id) as first_date,log_date
from user_log
)aa
group by aa.first_date
)a
left join
(
--计算第二天回访的用户量,计算当前时间与第一次登陆的时间差
select convert(bb.first_date,date) as 'date',count(distinct bb.user_id) as '2_back'
from
(
select user_id,log_date,min(log_date)over(partition by user_id) as first_date
from user_log
)bb
where datediff(bb.log_date,bb.first_date) = 1
group by bb.first_date
)b on a.date = b.date
left join
(
--计算第二天回访的用户量,计算当前时间与第一次登陆的时间差
select convert(cc.first_date,date) as 'date',count(distinct cc.user_id) as '30_back'
from
(
select user_id,log_date,min(log_date)over(partition by user_id) as first_date
from user_log
)cc
where datediff(cc.log_date,cc.first_date) = 29
group by cc.first_date
)c on a.date = c.date
2.找近90天,30天,7天的登录人数
select count(distinct case when datediff(now(),log_time)<=90 then user_id else null end) as 90_log_users,
count(distinct case when datediff(now(),log_time)<=30 then user_id else null end) as 30_log_users,
count(distinct case when datediff(now(),log_time)<=7 then user_id else null end) as 7_log_users
from user_log
3.求用户近一个月平均登录时间间隔(按天)
select user_id,avg(diff)
from
(
select user_id,lead(log_time,1)over(partition user_id order by log_time) - log_time as diff
from user_log
)t
where datediff(now(),t.log_time)<=30
group by user_id
六.case专题-统计特征(中位数,众数,四分位数)
字段:店铺id(shop_id),销量(sale),商品id(commodity_id)
1.求每个店铺商品销量排名的中位数
解法一:常规思路
设每个店铺销量组成的序列长度是cnt
当cnt是偶数时,中位数所在序号是cnt/2,cnt/2+1
当cnt是奇数时,中位数所在序号是ceil(cnt/2)
select shop_id,avg(sale)
from
(
select sale,
count(1)over(partition by shop_id) as total, --shop对应数据行数
--用来计算偶数中位数
cast(count(1)over(partition by shop_id) as decimal) / 2 as even_mid,
--用来计算奇数中位数
ceiling(cast(count(1)over(partition by shop_id) as decimal)/2)as odd_mid,
row_number()over(partition by shop_id order by sale) as sale_rank
from orders
) t
where ( total%2=0 and sale_rank in (even_mid,even_mid+1)) or ( total%2=1 and sale_rank = mid)
group by shop_id
解法二:
abs(rn - (cnt+1)/2) < 1
解释下上面的公式:
rn是给定长度为cnt的数列的序号排序,
eg:对于1,2,3,4,5,它的中位数所在序号是3,3-(5+1)/2 = 0
对于1,2,3,4,它的中位数所在序号是2,3
2 - (4+1)/2 = -0.5
3-(4+1)/2 = 0.5
可见(cnt+1)/2是一个数列的中间位置,如果是奇数数列,这个位置刚好是中位数所在
如果是偶数,abs(rn - (cnt+1)/2) < 1
select shop_id,avg(sale) as median
from
(
select shop_id,sale,
row_number() over (partition by shop_id order by sale) as rn, -- 各商品销量在其店铺内的顺序编号
count(1) over (partition by shop_id) as cnt -- 各店铺的商品记录数
from orders
)
where abs(rn - (cnt+1)/2) < 1 -- 顺序编号在店铺商品销量记录数中间的,即为中位数
group by shop_id
解法三:不用窗口函数,不排序,利用中位数定义
对于一个奇数长度数组中的中位数,大于这个数的数值个数等于小于这个数的数值个数。
当数组长度为偶数,且元素唯一时,中位数等于排序后中间两个数 的平均值。对这两个数来说,大于当前数的数值个数跟小于当前数的数值个数绝对值之差为 1,恰好等于这个数出现的频率。
总的来说,不管是数组长度是奇是偶,也不管元素是不是唯一,中位数出现的频率一定大于等于大于它的数和小于它的数的绝对值之差。
1.使用自连接
2.其中计算各个数的出现频率 sum(case when t1.sale = t2.sale then 1 else 0 end)
3.大于它(每个数)的数和小于它的数的绝对值之差 abs(sum(sign(t1.sale - t2.sale)))
select shop_id,avg(sale) as median
from
(
select t1.shop_id,t1.sale
from orders t1 left join orders t2
on t1.shop_id = t2.shop_id
group by t1.shop_id,t1.shop_id
having sum(case
when t1.sale = t2.sale then 1
else 0
end) >= abs(sum(sign(t1.sale - t2.sale)))
)t
group by t.shop_id
2.求每个店铺订购商品的众数
select shop_id,commodity_id,order_num
from
(
select shop_id,commodity_id,order_num,max(order_num)over(partition by shop_id) as max_order_num
(
select shop_id,commodity_id,count(*) as order_num
from orders
group by shop_id,commodity_id
) t1
) t2
where order_num = max_order_num
3.求四分位数
三个表
T1:good_id,cate_id(分类)
T2:mall_id(店铺), good_id
T3:mall_id, credit_score(信用分)
问,在不同分类中,店铺的信用分前100和求top25%
select cate_id,mall_id,credit_score
from
(
select cate_id,mall_id,
credit_score,
row_number()over(partition by cate_id order by credit_score desc) as score_rn
count(*)over(partition by cate_id) as mall_total
from
(--选择商品分类id,店铺id,商品信用分数
select T1.cate_id,T2.mall_id,credit_score
from T1 left join (T2 left join T3 on T2.mall_id = T3.mall_id) on T1.good_id = T2.good_id
group by T1.cate_id,T2.mall_id
)t
)
where score_rn <= mall_total*0.25
七.case专题-GMV周同比统计
字段:时间(sale_date),店铺类别(cate_id),店铺数量(mall_num),gmv
1.拼多多618前后一周内各店铺类别gmv的日均提升幅度和比例
注:以下解法只适用于数据连续情况,如果数据不连续,例如对于上一周没有星期六,星期日的数据,lead(gmv,7)这样规定移动窗口必然会出现错误。
select cate_id,avg(gmv_diff),avg(gmv_rate)
from
(
select sale_date,cate_id,mall_num,gmv,
(lead(gmv,7)over(partition by cate_id order by sale_date)-gmv) as gmv_diff,
(lead(gmv,7)over(partition by cate_id order by sale_date)-gmv) / gmv as gmv_rate
from T
where sale_date between '20190611' and '20190624'
)t
where sale_date between '20190611' and '20190617'
group by cate_id,mall_num,gmv
2.在618前一周gmv top20%,20-40%等这5类商铺在618后一周内gmv日均提升幅度和比例
select cate_id,avg(gmv_diff),avg(gmv_rate)
from
(
select sale_date,cate_id,mall_num,gmv,
(lead(gmv,7)over(partition by cate_id order by sale_date)-gmv) as gmv_diff,
(lead(gmv,7)over(partition by cate_id order by sale_date)-gmv) / gmv as gmv_rate
from T
where sale_date between '20190611' and '20190624'
)t1 left join
(
select cate_id,
case when gmv_rank <= total*0.1 then '10%'
when gmv_rank <= total*0.2 and gmv_rank > total*0.1 then '10%-20%'
when gmv_rank <= total*0.3 and gmv_rank > total*0.2 then '20%-30%'
when gmv_rank <= total*0.4 and gmv_ranl > total*0.3 then '30%-40%'
when gmv_rank <= total*0.5 and gmv_ranl > total*0.4 then '40%-50%'
when gmv_ranl > total*0.5 then '50%以上'
else NULL
end as gmv_quantile
from
(
select cate_id,
row_number()over(order by total_gmv desc) as gmv_rank,
count(*)over() as total
from
(
select cate_id,
sum(gmv) as total_gmv
from T
where sale_date between '20190611' and '20190617'
group by cate_id
)t3
)t4
)t2 on t1.cate_id = t2.cate_id
where sale_date between '20190611' and '20190617'
and t2.gmv_quantile in ('10%','10%-20%','20%-30%','30%-40','40%-50%')
group by cate_id
八.case专题-连续区间问题
(拼多多二面面试题,就是找到一个session的开始和结束位置,但是具体题目条件记不清楚了,这里放个类似的题目,看看这一类题目的解题思路)
表:Logs+---------------+---------+| Column Name | Type |+---------------+---------+| log_id | int |+---------------+---------+id 是上表的主键。上表的每一行包含日志表中的一个 ID。后来一些 ID 从 Logs 表中删除。编写一个 SQL 查询得到 Logs 表中的连续区间的开始数字和结束数字。将查询表按照 start_id 排序。查询结果格式如下面的例子:Logs 表:+------------+| log_id |+------------+| 1 || 2 || 3 || 7 || 8 || 10 |+------------+结果表:+------------+--------------+| start_id | end_id |+------------+--------------+| 1 | 3 || 7 | 8 || 10 | 10 |+------------+--------------+结果表应包含 Logs 表中的所有区间。从 1 到 3 在表中。从 4 到 6 不在表中。从 7 到 8 在表中。9 不在表中。10 在表中。
解题思路:(对于一些难度较高的SQL面试题,建议复习下自定义变量的用法)
(1)使用两个变量,一个@id用来记录logid,可以比较当下log_id与之前log_id的差值,判断是否连续。一个@num用来储存连续状态。
select log_id,
case when @id = log_id - 1 then @num := @num
else @num := @num + 1
end num,
@id := log_id
from Logs,(select @num := 0,@id := null)t
上述输出结果如下所示:
输入:{"headers":{"Logs":["log_id"]},"rows":{"Logs":[[1],[2],[3],[7],[8],[10]]}}
输出:{"headers": ["log_id", "num", "@prev_id := log_id"], "values": [[1, "1", 1], [2, "1", 2], [3, "1", 3], [7, "2", 7], [8, "2", 8], [10, "3", 10]]}
(2)得到上述结果后,用num字段分组,最小log_id为start_id,最大log_id为end_id。
最终答案
select min(log_id) start_id,
max(log_id) end_id
from
(
select log_id,
case when @id = log_id - 1 then @num := @num
else @num := @num + 1
end num,
@id := log_id
from Logs,(select @num := 0,@id := null)t
)t2
group by num
猿辅导面试题
九.case专题-学生成绩分析
表:Enrollments
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| student_id | int |
| course_id | int |
| grade | int |
+---------------+---------+
(student_id, course_id) 是该表的主键。
1.查询每位学生获得的最高成绩和它所对应的科目,若科目成绩并列,取 course_id 最小的一门。查询结果需按 student_id 增序进行排序。
解法一:窗口函数
select a.student_id,a.course_id,a.grade
from
(
select student_id,course_id,grade,
row_number()over(partition by student_id order by course_id) as course_rank
from
(
select student_id,course_id,grade,
dense_rank()over(partition by student_id order by grade desc) as grade_rank
from Enrollments
) t
where t.grade_rank = 1
) a
where a.course_rank = 1
order by a.student_id
解法二: in解法(更简单,快速)
select student_id,min(course_id) as course_id,grade
from Enrollments
where (student_id, grade) in (select student_id,max(grade) from Enrollments group by student_id)
group by student_id,grade
order by student_id
2.查询每一科目成绩最高和最低分数的学生,输出courseid,studentid,score
解题思路:
(1)先分别查询最高和最低分数学生
(2)然后使用union合并
预备知识:union和union all
union对两个结果集进行并集操作,两个联合的字段必须一样
两者区别是union要进行重复值扫描,不包括重复行,同时进行默认规则的排序,效率低
union all包括重复行,不进行排序
请注意,union 内部的每个select语句必须拥有相同数量的列。列也必须拥有相似的数据类型。同时,每个select语句中的列的顺序必须相同。
如果union的两列数据类型不一样,可以用强制类型转化为相同类型,convert(value,type)
-- 查询每一科目成绩最高分数的学生
select e.course_id,e.student_id,e.grade as score
from Enrollments e left join
(
select course_id,max(grade) as max_grade
from Enrollments
group by course_id
) t on e.course_id = t.course_id
where e.grade = t.max_grade
union all 合并
select e.course_id,e.student_id,e.grade as score
from Enrollments e left join
(
select course_id,max(grade) as max_grade
from Enrollments
group by course_id
) t on e.course_id = t.course_id
where e.grade = t.max_grade
union all
select e.course_id,e.student_id,e.grade as score
from Enrollments e left join
(
select course_id,min(grade) as min_grade
from Enrollments
group by course_id
) t on e.course_id = t.course_id
where e.grade = t.min_grade
十.case专题-学生做题情况分析
表t:做题日期(time),学生id(stu_id),题目id(exer_id)
统计10.1-10.10每天做新题的人的数量,重点在每天
select count(1) as result
from
(
select a.stu_id,count(distinct a.first_time) as cnt
from
(
select stu_id,exer_id,time,min(time)over(partition by stu_id,exer_id) as first_time
from t
where convert(time,da between ‘20191001’ and ‘20191010’
) a
where a.time = a.first_time
group by a.stu_id
having cnt = 10
) b
十一.Hive面试高频知识点-行转列
1.Hive中的UDF
Hive中有三种UDF:
1、用户定义函数(user-defined function)UDF;
2、用户定义聚集函数(user-defined aggregate function,UDAF);
3、用户定义表生成函数(user-defined table-generating function,UDTF)。
UDF操作作用于单个数据行,并且产生一个数据行作为输出。大多数函数都属于这一类(比如数学函数和字符串函数)。eg:round() abs()
UDAF 接受多个输入数据行,并产生一个输出数据行。像COUNT和MAX这样的函数就是聚集函数。
UDTF 操作作用于单个数据行,并且产生多个数据行-------一个表作为输出。
例如:explode()
简单来说:
UDF:返回对应值,一对一 | UDAF:返回聚类值,多对一 | UDTF:返回拆分值,一对多
2.lateral view explode()
这个是面试高频知识点,也是工作中经常遇到的
下面以搜索业务中用户行为数据为例,为了简单展示,只拿出日志中的部分字段
search_id:唯一标识别每一次搜索记录
query:搜索词
click_detail:点击明细字段
数据样例:
1,'知乎',[{"pos":'4',"time":''20190411 5:30:00'',"type":'A'},{"pos":'3',"time":''20190411 5:30:00'',"type":'B'},{"pos":'2',"time":''20190411 5:30:00'',"type":'C'}]
数据是随便写的,实际数据也比这个复杂,这里只是用来说明其中的数据格式
可以看到click_detail是个Array,代表一次搜索页面结果下的点击情况,pos代表点击位置,time代表点击时间,type代表搜索结果类型
里面是个Map格式
现在需要查询某一天搜索知乎这个query,同时搜索结果type为A的点击量.
第一步:explode()
explode()接受Array或者Map作为输入,输出Array (Map)中的每一个元素作为新的行,即将复杂的结构,由一行转化成多行
例如对于上述数据样例,执行
select explode(click_detail) as click from t;
执行结果如下:
{"pos":'4',"time":''20190411 5:30:00'',"type":'A'}
{"pos":'3',"time":''20190411 5:30:00'',"type":'B'}
{"pos":'2',"time":''20190411 5:30:00'',"type":'C'}
但是explode有自己的局限性,其中一点就是不能关联其他字段,使用explode(udtf)时,只支持一个字段,例如:
--执行以下语句会报错
select search_id,query,explode(click_detail) as click from t;
第二步:lateral view explode()
lateral view 用于和UDTF相结合使用。他会将UDTF生成的结果放在一张虚拟表中(即lateral view里)。虚拟表相当于再和主表关联, 从而达到添加“UDTF(explode)生成的字段“以外的字段。
select count(search_id) as '目标点击量'
from t lateral view explode(clk_detail) clk_detail as click
where partition_stat_date = 20190522 --时间分区
and click['type'] = 'A'
and query = '知乎'