第二套【窗口函数 实现分组取TOP N】
第三套 【日活、留存:行转列+datediff函数】
第六套 【窗口函数 sum() over()】
第七套【建立临时表】
第八套 【行列转换:单列拆分多行(更优解),字符串处理】
第九套【DAU各类实战】(重要)
第十套
题目来源 n套SQL面试题--行转列、留存、日活等,原文答案有错误,这里完全按题目需求进行查询
(题目清晰版本可以参考数据分析SQL面试题目9套汇总,答案同样存在错误)
思路:
(1)先处理场景重复的情况,建立子表a
(2)添加一列row_n,按id分组进行排序
(3)取每组前两名,按id分组后,在组内用连接字符串
select concat(temp.id, '-', group_concat(temp.scene seperator '-'))
from
(select id, scene, time, row_number() over(partition by id order by scene, time) as row_n
from
(select id, scene, min(time) as time
from tb
group by id, scene
order by id, scene) a
) temp
where row_n<=2
group by id
这里的留存定义比较奇葩,但这样子计算比较简单,正常来说留存应该考虑的是当日注册用户在N日仍然登录的比例。
思路:
(1)复用表a,连接的条件为uid相同
(2)通过datediff()筛选出b.dayno和a.dayno相差1,3,7天,行转列
update userinfo set dayno=str_to_date(dayno,'%Y-%m-%d');
select
a.dayno 日期,count(distinct a.uid) 活跃,
count(distinct case when datediff(b.dayno,a.dayno)=1 then a.uid end) 次留,
count(distinct case when datediff(b.dayno,a.dayno)=3 then a.uid end) 三留,
count(distinct case when datediff(b.dayno,a.dayno)=7 then a.uid end) 七留,
concat(count(distinct case when datediff(b.dayno,a.dayno)=1 then a.uid end)/count(distinct a.uid)*100,'%') 次日留存率,
concat(count(distinct case when datediff(b.dayno,a.dayno)=3 then a.uid end)/count(distinct a.uid)*100,'%') 三日留存率,
concat(count(distinct case when datediff(b.dayno,a.dayno)=7 then a.uid end)/count(distinct a.uid)*100,'%') 七日留存率
from userinfo a
left join userinfo b
on a.uid=b.uid
where a.app_name='相机'
AND
b.app_name='相机'
group by a.dayno;
select fyear, fmonth, `value`,
sum(`value`) over(partition by fyear order by fyear, fmonth) as ysum,
sum(`value`) over(order by fyear, fmonth) `sum`
from
(select year(fdate) as fyear,
month(fdate) as fmonth,
sum(`value`) as `value`
from a2
group by fyear, fmonth
order by fyear, fmonth) b
select ID, Name, EmailAddress, max(LastLogon) latestlogon, count(distinct(date(LastLogon))) countlogon
from tb
group by ID
create temporary table temptb as
select name,Lastlogon,
row_number() over(partition by ID order by Lastlogon) num_logontime,
dense_rank() over(partition by ID order by date(Lastlogon)) num_logonday
from tb
--表a变成表b
select qq, group_concat(game seperator '_')
from a
group by qq
--先创建临时的序列表seq
create table temporary seq (
id int auto_increment not null,
primary key(id));
--插入的value个数跟最终生成的行数相同
insert into seq values(),(),(),(),();
select b.qq,
substring(replace(substring_index(b.game,'_',seq.id),'_',''),seq.id) game
from seq cross join
(select b.*,
((length(game)-length(replace(game,'_','')))/length('_'))+1 as size
from b) b
on seq.id<=b.size
从表B变成表A还有一种不需要新建序列表的方法,来自知乎原文,运用的是mysql内置的表格属性,这种方法更好,数据多的时候不用新建序列表,推介这种做法!!
需要注意help_topic表格的id列是从0开始计算的:
select qq,
substring_index(substring_index(game,"_",help_topic_id+1),"_",-1) as game
from a
left join mysql.help_topic as b
on help_topic_id < (length(game)-length(replace(game,"_",""))+1);
select imp_date, count(distinct qimei) DAU
from tmp_liujg_dau_based d
where imp_date>=20190601
group by imp_date
这里的坑是:存在用户未登录,但领取了红包,这样的用户的is_new是null值,在计算时会被忽略。
select a.imp_date,
count(distinct case when a.is_new=1 then a.qimei else null end) as '新用户数',
count(distinct case when a.is_new=0 then a.qimei else null end) as '老用户数',
count(distinct case when a.is_new=2 then a.qimei else null end) as '未登录用户',
FORMAT(sum(a.add_money)/count(distinct a.qimei),2) as '人均领取金额',
format(count(a.qimei)/count(distinct a.qimei),0) as '人均领取次数'
from
(
select p.imp_date,p.qimei,p.add_money,
(case
when d.is_new=1 then 1
when d.is_new=0 then 0
else 2
end) as is_new
from tmp_liujg_packed_based p
left join tmp_liujg_dau_based d
on p.imp_date=d.imp_date and p.qimei=d.qimei
where p.imp_date>=20190601
) a
group by a.imp_date
select substring(imp_date,1,6),count(distinct imp_date) as '领取天数',
count(distinct qimei) as '领取人数',
format(sum(add_money)/count(distinct qimei),2) as '人均领取金额',
format(count(qimei)/count(distinct qimei),0) as '人均领取次数'
from tmp_liujg_packed_based
where imp_date>=20190301
group by substring(imp_date,1,6)
select left(imp_date,6) '日期',
is_packed,
count(distinct qimei) '用户数量',
round(count(*)/count(distinct qimei)) '平均月活跃天数'
from(
select d.imp_date, d.qimei,
(case
when p.qimei is null then '未领取红包'
else '领取红包'
end) is_packed
from tmp_liujg_dau_based d
left join tmp_liujg_packed_based p
on d.imp_date=p.imp_date and d.qimei=p.qimei) a
group by left(imp_date,6), is_packed
select left(a.imp_date,6) '日期', a.qimei '活跃用户', d.imp_date '注册日期'
from tmp_liujg_dau_based d
right join
(select *
from tmp_liujg_dau_based
where imp_date>=20190301) a
on a.qimei=d.qimei and d.is_new=1
order by '日期'
select imp_date,
count(distinct case when datediff(liu_date,imp_date)=1 and is_new=1 then qimei else null end)/count(distinct case when is_new=1 then qimei else null end) as '次留',
count(distinct case when datediff(liu_date,imp_date)=1 and is_new=1 and is_packed is not null then qimei else null end)/count(distinct case when is_packed is not null and is_new=1 then qimei else null end) as '领取红包用户次留',
count(distinct case when datediff(liu_date,imp_date)=1 and is_new=1 and is_packed is null then qimei else null end)/count(distinct case when is_packed is null and is_new=1 then qimei else null end) as '未领取红包用户次留'
from
(select a.*, p.qimei as is_packed
from
(select d1.*, d2.imp_date as liu_date, d2.qimei as liu_qimei
from tmp_liujg_dau_based d1
left join tmp_liujg_dau_based d2
on d1.imp_date>=20190301 and d1.qimei=d2.qimei) a
left join tmp_liujg_packed_based p
on p.imp_date = a.imp_date and p.qimei=a.qimei
) tmp
group by imp_date
思路:
(1)看到计算次留,就要left join原表,获得同一个id不同登录时间的组合,从而挑选出时间组合相差1天的数据
(2)left join领取红包表,获得某个id在当天是否领取了红包,如果领取了is_packed会记录下id,如果没有则为null
(3)count(distinct case when)组合计算,不要忘记is_new=1这个约束条件,因为正常来说,我们将次日留存率=(当日新增用户在第n日登陆人数)/(当日新增用户),领取红包用户的次日留存=(当日领取了红包的新增用户在第n日登陆人数)/(当日领取了红包的新增用户)【感觉这个题目是想分析是否领取红包对新增用户的次留影响】
select imp_date, qimei,add_money
from
(select b.*,
row_number() over(partition by imp_date,qimei order by report_time) as seq
from(
select a.imp_date,a.qimei, p.report_time, p.add_money
from
(select *
from tmp_liujg_dau_based
where is_new=1 and imp_date>=20190601) a
inner join tmp_liujg_packed_based p
on a.qimei=p.qimei) b
) tmp
where seq=1
select imp_date,qimei,first_date,second_date,TIMESTAMPDIFF(minute,first_date,second_date) as '时间差'
from
(# 行转列方便求差值
select imp_date,qimei,
max(case when seq=1 then report_time else null end) as first_date,
max(case when seq=2 then report_time else null end) as second_date
from
(# 为了选出分组top2, 添加一列分组排序
select a.*,row_number() over(partition by imp_date,qimei order by report_time) as seq
from
(# 获取这些id所有的领取红包记录
select d.imp_date, d.qimei, p.report_time
from tmp_liujg_dau_based d
inner join tmp_liujg_packed_based p
on d.qimei=p.qimei
where d.qimei in
(# 筛选出注册当日有领取红包的用户id
select distinct d.qimei
from tmp_liujg_dau_based d
left join tmp_liujg_packed_based p
on d.imp_date = p.imp_date and d.qimei=p.qimei
where d.is_new=1 and p.report_time is not null)
and d.is_new=1) a
) tmp
group by imp_date,qimei) b
思路:
(1)首先 筛选出注册当日有领取红包的用户id
(2)left join 红包记录表,得到这些id的所有获取红包记录
(3)按日期、id分组后,组内根据获取时间进行排序
(4)对排名第一和第二的记录进行行转列,方便进行求差
select g.department as department, g.game_name as game_name, sum(i.income_money) as sum_income_money
from game g
left join income i
on g.game_id=i.game_id
where i.income_time BETWEEN '2020-01-01' and '2020-03-31'
group by g.department, g.game_id