开窗函数
需求:
既要显示聚合前的数据,又要显示聚合后的结果
id name sal
1 zs 3w
2 ls 2.5w
3 ww 2w
rank
id name sal rank
1 zs 3w 1
2 ls 2.5w 2
3 ww 2w 3
开窗函数:
窗口函数: 窗口 + 函数
窗口: 函数运行时计算的数据集的范围
函数:运行时的函数:
1.常用的聚合函数
2.窗口内置的函数
RANK:从1开始,按照顺序,生成组内记录的编号,排序相同会重复,在名次中留下空位
ROW_NUMBER:从1开始,按照顺序,生成组内记录的编号,序号没有重复的
DENSE_RANK:从1开始,按照顺序,生成组内记录的编号,排序相同会重复,在名次中不留下空位
NTILE(n):把数据按照姓名进行分组 时间排序 结果数据分成 3份数
LAG 向上取第几行
LEAD向下取第几行
FIRST_VALUE(col):取分组后 截止到当前行 第一个值
LAST_VALUE(col):取分组后 截止到当前行 最后一个值
语法结构:
函数 over([partition by xxx,...] [order by xxx,...])
over: 以谁进行开窗 =》 table
partition by:以谁进行分组(类似group by ) =》 字段
order by:以谁进行排序 =》字段
1.聚合函数 -》开窗
数据:
zuoshao,2022-03-10,1
zuoshao,2022-03-11,2
zuoshao,2022-03-12,7
zuoshao,2022-03-13,3
zuoshao,2022-03-14,2
zuoshao,2022-03-15,4
zuoshao,2022-03-16,4
create table zuoshao_window(
name varchar(255),
dt varchar(20),
cnt int
);
需求:
统计累计的问题,每个用户每天累计点外卖的次数
name dt cnt sum_cnt
zuoshao,2022-03-10,1,1
zuoshao,2022-03-11,2,3
zuoshao,2022-03-12,7,10
zuoshao,2022-03-13,3,13
zuoshao,2022-03-14,2,15
zuoshao,2022-03-15,4,19
zuoshao,2022-03-16,4,23
函数 over([partition by xxx,...] [order by xxx,...])
2.指定窗口大小
select
name,
dt,
cnt,
sum(cnt) over(partition by name order by dt) as sum_cnt,
//无上限到当前行
sum(cnt) over(partition by name order by dt rows between unbounded preceding and current row ) as sum_cnt1,
//不排序 所有
sum(cnt) over(partition by name ) as sum_cnt2,
//当前行的前3行到当前行
sum(cnt) over(partition by name order by dt rows between 3 preceding and current row) as sum_cnt3,
//当前行的前3行到后一行
sum(cnt) over(partition by name order by dt rows between 3 preceding and 1 following) as sum_cnt4,
当前行到底行
sum(cnt) over(partition by name order by dt rows between current row and unbounded following) as sum_cnt5
from zuoshao_window;
select
name,
dt,
cnt,
sum(cnt) over(partition by name order by dt) as sum_cnt,
sum(cnt) over(partition by name ) as sum_cnt2,
sum(cnt) over(order by dt ) as sum_cnt3,
sum(cnt) over( ) as sum_cnt4
from zuoshao_window
order by dt,name ;
+---------+------------+------+---------+----------+----------+----------+
| name | dt | cnt | sum_cnt | sum_cnt2 | sum_cnt3 | sum_cnt4 |
+---------+------------+------+---------+----------+----------+----------+
| zuoshao | 2022-03-10 | 1 | 1 | 23 | 1 | 23 |
| zuoshao | 2022-03-11 | 2 | 3 | 23 | 3 | 23 |
| zuoshao | 2022-03-12 | 7 | 10 | 23 | 10 | 23 |
| zuoshao | 2022-03-13 | 3 | 13 | 23 | 13 | 23 |
| zuoshao | 2022-03-14 | 2 | 15 | 23 | 15 | 23 |
| zuoshao | 2022-03-15 | 4 | 19 | 23 | 19 | 23 |
| zuoshao | 2022-03-16 | 4 | 23 | 23 | 23 | 23 |
+---------+------------+------+---------+----------+----------+----------+
3.开窗 -内置函数
RANK
ROW_NUMBER
DENSE_RANK
NTILE
1.NTILE
需求:
把数据按照姓名进行分组 时间排序 结果数据分成 3份数
select
name,
dt,
cnt,
NTILE(3) over(partition by name order by dt) as n1,
NTILE(2) over(partition by name order by dt) as n2
from zuoshao_window
order by name ;
+---------+------------+------+------+------+
| name | dt | cnt | n1 | n2 |
+---------+------------+------+------+------+
| zuoshao | 2022-03-10 | 1 | 1 | 1 |
| zuoshao | 2022-03-11 | 2 | 1 | 1 |
| zuoshao | 2022-03-12 | 7 | 1 | 1 |
| zuoshao | 2022-03-13 | 3 | 2 | 1 |
| zuoshao | 2022-03-14 | 2 | 2 | 2 |
| zuoshao | 2022-03-15 | 4 | 3 | 2 |
| zuoshao | 2022-03-16 | 4 | 3 | 2 |
+---------+------------+------+------+------+
NTILE(N):
把数据平均分配到N中,如果不能平均分配,优先分配到较小的编号中。
2.rank相关的
RANK
ROW_NUMBER
DENSE_RANK
select
name,
dt,
cnt,
sum(cnt) over(partition by name order by dt) as sum_cnt,
rank() over(partition by name order by cnt) as rk,
row_number() over(partition by name order by cnt) as rn,
dense_rank() over(partition by name order by cnt) as d_rk
from zuoshao_window
order by name ;
+---------+------------+------+---------+----+----+------+
| name | dt | cnt | sum_cnt | rk | rn | d_rk |
+---------+------------+------+---------+----+----+------+
| zuoshao | 2022-03-10 | 1 | 1 | 1 | 1 | 1 |
| zuoshao | 2022-03-11 | 2 | 3 | 2 | 2 | 2 |
| zuoshao | 2022-03-14 | 2 | 15 | 2 | 3 | 2 |
| zuoshao | 2022-03-13 | 3 | 13 | 4 | 4 | 3 |
| zuoshao | 2022-03-15 | 4 | 19 | 5 | 5 | 4 |
| zuoshao | 2022-03-16 | 4 | 23 | 5 | 6 | 4 |
| zuoshao | 2022-03-12 | 7 | 10 | 7 | 7 | 5 |
+---------+------------+------+---------+----+----+------+
RANK:
从1开始,按照顺序,生成组内记录的编号,排序相同会重复,在名次中留下空位
ROW_NUMBER:
从1开始,按照顺序,生成组内记录的编号,序号没有重复的
DENSE_RANK:
从1开始,按照顺序,生成组内记录的编号,排序相同会重复,在名次中不留下空位
3.蹿行问题
lag 向上取第几行
lead 向下取第几行
- lag(input[, offset[, default]])
input =》 col
offset =》 取第几行
default=》 取不到时 给一个默认值
select
name,
dt,
cnt,
sum(cnt) over(partition by name order by dt) as sum_cnt,
lag(dt,1,"9999-99-99") over(partition by name order by dt) as lag_v,
lead(dt,1,"9999-99-99") over(partition by name order by dt) as lead_v
from zuoshao_window
order by name ;
4.取值问题
FIRST_VALUE(col):取分组后 截止到当前行 第一个值
LAST_VALUE(col):取分组后 截止到当前行 最后一个值
select
name,
dt,
cnt,
sum(cnt) over(partition by name order by dt) as sum_cnt,
first_value(dt) over(partition by name order by dt) as first_v,
last_value(dt) over(partition by name order by dt) as last_v
from zuoshao_window
order by name ;
开窗函数案例:
1.我们有如下的用户访问数据
userId visitDate visitCount
u01 2017/1/21 5
u02 2017/1/23 6
u03 2017/1/22 8
u04 2017/1/20 3
u01 2017/1/23 6
u01 2017/2/21 8
U02 2017/1/23 6
U01 2017/2/22 4
要求使用SQL统计出每个用户的累积访问次数,如下表所示:
用户id 月份 小计 累积
u01 2017-01 11 11
u01 2017-02 12 23
u02 2017-01 12 12
u03 2017-01 8 8
u04 2017-01 3 3
每个用户的累积访问次数=》
每个用户每个月累计访问次数
维度: 用户、月
指标:次数、累计访问次数
1.etl:
2017/2/22 =》 2017-02 日期函数 ,string函数 sql里面
2017/2/22=>2017-2-22
2.
1.先求 每个月 次数
2. 1结果 =》 累计
//使用date_format实现
select
-- 每个用户每个月 累计次数
userid,
month,
cnt,
sum(cnt) over(partition by userid order by month) as sum_cnt
from
(
select -- 每个用户每个月 次数
userid,
date_format(replace(visitdate,'/','-'), '%Y-%m') as month ,
sum(visitcount) as cnt
from test1
group by
userid,month
) a;
也可以使用str_to_date
date_format(str_to_date(visitdate,'%Y/%m/%d') ,'%Y-%m')as month
//使用substr实现
select
-- 每个用户每个月 累计次数
userid,
month,
cnt,
sum(cnt) over(partition by userid order by month) as sum_cnt
from
(
select -- 每个用户每个月 次数
userid,
substr(replace(visitdate,'/','-'),1,6) as month,
sum(visitcount) as cnt
from test1
group by
userid,month
) a;
2.有50W个京东店铺,每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志,
访问日志存储的表名为Visit,访客的用户id为user_id,被访问的店铺名称为shop,数据如下:
u1 a
u2 b
u1 b
u1 a
u3 c
u4 b
u1 a
u2 c
u5 b
u4 b
u6 c
u2 c
u1 b
u2 a
u2 a
u3 a
u5 a
u5 a
u5 a
请统计:
(1)每个店铺的UV(访客数) 、pv(访问量)
维度:店铺
指标:uv =》user_id
//使用distinct
select
shop,
count(distinct user_id) as uv ,
count(user_id) as pv
from test2
group by
shop;
//使用union all
//最后的字段shop、uv、pv
-- 合并表
select
shop,
sum(uv) uv ,
sum(pv) pv
from
(
select
shop,
0 as uv ,
count(1) as pv
from test2
group by
shop
union all
select
shop,
count(1) as uv ,
0 as pv
from (
select
shop,
user_id
from test2
group by
shop,
user_id
) a
group by
shop
) a
group by
shop;
//先算每个shop每个人的pv
select
shop,
count(1) as user_uv,
sum(pv) as user_pv
from
(
select
user_id,
shop,
count(user_id) as pv
from test2
group by
shop,user_id
) tmp
group by
shop;
(2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
维度:店铺、 访客id
指标:访问次数 、访问次数的top3
select
shop,
user_id,
cnt,
rk
from
(
select
shop,
user_id,
cnt,
row_number() over (partition by shop order by cnt desc) as rk
from
(
select
shop,
user_id,
count(1) as cnt
from test2
group by
shop,
user_id
) a
) a
where
rk <=3;
uv pv :
pv => page 次数 不需要去重
uv => user 次数 需要去重
2.行转列 &&& 列转行
1.列转行
create table t1(
name varchar(255),
hobby varchar(255)
);
insert into t1 VALUES("zuoshao","王者荣耀");
insert into t1 VALUES("zuoshao","黑丝");
insert into t1 VALUES("zuoshao","看小视频");
insert into t1 VALUES("xuanxuan","姐姐");
insert into t1 VALUES("xuanxuan","天天");
insert into t1 VALUES("xuanxuan","杰伦");
||
v
zuoshao,<王者荣耀,黑丝,看小视频>
xuanxuan,<姐姐,天天,杰伦>
mysql没有collection_list(hobby)
可以使用group_concat(hobby)
select
name,
group_concat(hobby) as hobbyies
from t1
group by
name;
-- hive
select
name,
concat_ws(",",collection_list(hobby)) as hobbyies
from t1
group by
name;
concat_ws
concat
在字符串之间加个分隔符
SELECT CONCAT_WS("-", "江西省", "赣州市", "于都县");
SELECT CONCAT_WS("-", "江西省", NULL, "于都县");
SELECT CONCAT_WS("-", NULL, "赣州市", "于都县");
如果分隔符是NULL,则结果为NULL
SELECT CONCAT_WS(NULL, "江西省", "赣州市", "于都县");
直接拼接起来
SELECT CONCAT("-", "江西省", "赣州市", "于都县");
2.行转列
hive爆破函数实现
select
name,
hobby_t
from hobby
lateral view expolde( split(hobbyies,",") ) hobby_tmp as hobby_t;
mysql没有爆破函数(免费版本没有)
split substring_index(col,",",num)
select substring_index("1,2,3,4",",",1)
mysql> select substring_index("1,2,3,4",",",1);
+----------------------------------+
| substring_index("1,2,3,4",",",1) |
+----------------------------------+
| 1 |
+----------------------------------+