说明:
over关键字用来指定函数执行的窗口范围,若后面括号中什么都不写,则意味着窗口包含满足WHERE条件的所有行,窗口函数基于所有行进行计算;如果不为空,则支持以下4中语法来设置窗口。
①window_name:给窗口指定一个别名。如果SQL中涉及的窗口较多,采用别名可以看起来更清晰易读;
②PARTITION BY 子句:窗口按照哪些字段进行分组,窗口函数在不同的分组上分别执行;
③ORDER BY子句:按照哪些字段进行排序,窗口函数将按照排序后的记录顺序进行编号;
④FRAME子句:FRAME是当前分区的一个子集,子句用来定义子集的规则,通常用来作为滑动窗口使用
1.序号函数
image.png
案例一
create table test_window
(logday string,
userid string,
score int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
1、使用 over() 函数进行数据统计, 统计每个用户及表中数据的总数
2、求用户明细并统计每天的用户总数
3、计算从第一天到现在的所有 score 大于80分的用户总数
4、计算每个用户到当前日期分数大于80的天数
1.
select
t1.userid,
sum(1) over(PARTITION BY t1.userid)
from
test_window t1;
2.
select
t2.logday,count(1)
from
(
select
#去重
t1.logday,count(distinct t1.userid) as conid
from
test_window t1
group by t1.logday,t1.userid) t2
group by t2.logday;
2.#不去重 窗口写法...简写
select *,count()over(partition by logday)as day_total from test_window;
2.#去重 窗口写法
select t2.logday ,max(maxcon) as maxcon
from
(
select t1.logday , count(distinct t1.userid) over(partition by t1.logday order by t1.userid) as maxcon
from test_window t1)t2
group by t2.logday;
3.
#假设 不去重窗口写法 滚动查看 全局统计
select
t1.logday,
count()over(order by t1.logday rows between unbounded preceding and current row) as sumnumber
from
test_window t1
where t1.score > 80;
20191020 1
20191020 2
20191020 3
20191021 4
20191021 5
20191022 6
20191023 7
3.
#假设 不去重
select
count(t1.userid)
from
test_window t1
where t1.score > 80;
4.窗口写法 理解 partition by 和 order by 在窗口函数中的作用......分区和全局
select t1.logday , t1.userid,
count() over(partition by t1.userid order by t1.logday rows between unbounded preceding and current row) as sumnumber
from
test_window t1
where t1.score > 80
order by t1.logday,t1.userid;
4.
select
t1.userid,count(t1.score)
from
test_window t1
where t1.score > 80
group by t1.userid;
案例二
测试数据
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
create table business
(
name string,
orderdate string,
cost int
)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
#加载数据
load data local inpath "/root/business.txt" into table business;
1、查询在2017年4月份购买过的顾客及总人数
2、查询顾客的购买明细及月购买总额
3、查询顾客的购买明细及到目前为止每个顾客购买总金额
4、查询顾客上次的购买时间----lag()over()偏移量分析函数的运用
5、查询前20%时间的订单信息
1.substr(ctime , 0 , 7)函数的使用 列转行函数 conllect_list/set ()....好像不支持窗口
select
count(1)
from
(
select
t1.name,substr(t1.orderdate,0,7)as orderdate
from
business t1)t2
group by t2.orderdate;
1
对函数 sum()over() 和 count()over() 的分析
1、如果先partition by X后order by Y则在分组的前提下,
同组中X字段值一样的记录再按Y字段排序,然后sum(a)over()的作用就是按a字段累加----分组连续求和(滚动查看)
2、若没有order by Y则sum(a) over()的执行是把分组后同组的记录一起处理,
当作一条记录来看待----分组求总和
3、若两者皆没有,即sum(a)over(),over后的括号中为空,
则sum(a)over()字段的值为a列所有值的和,每一条记录的sum(a)over()都一样----求总和
select
t1.name , count()over() as con
from
business t1
where substr(t1.orderdate,0,7)='2017-04';
2.查询顾客的购买明细及月购买总额
#窗口写法 难点是这么滚动查看???
select
t1.name ,t1.orderdate, sum(t1.cost) over(partition by t1.name , t1.orderdate ) as cost1
from
(
select
name,substr(orderdate,0,7) as orderdate ,cost
from
business
order by name ,orderdate ) t1
order by t1.name ,t1.orderdate ;
#简化写法
select
*,
sum(cost) over(partition by name,substr(orderdate,1,7) ) total_amount
from
business;
3查询顾客的购买明细及到目前为止每个顾客购买总金额
#反思 能滚动查看在于滚动的那一项是变化的,对于上一题在区间内是不变的
select
name , orderdate,sum(cost)over(partition by name order by orderdate rows between unbounded preceding and current row) as cost
from
business;
4查询顾客上次的购买时间----lag()over()偏移量分析函数的运用
#lag()over() LEAD()over() 函数的使用 反思 group by 和 窗口函数的区别
select
name,orderdate,
lag(orderdate,1,'初始值') over(partition by name order by orderdate) as preorderdate
from
business;
5查询前20%时间的订单信息
#NTILE(n)
select
*
from
(
select
name , NTILE(5) over(order by orderdate) as times
from
business )t1
where t1.times = 1;
案例三:
topN案例
有表score
想知道学生成绩排名前几的科目
select
*
from
(
select
*,
row_number() over(partition by subject order by score desc) rmp
from score
) t
where t.rmp<=3;
分组前%案例
有表 user_sales_table
user_name 用户名
pay_amount 用户支付额度
现在老板想知道支付金额在前20%的用户。
select
from(
select
user_name,
ntile(5) over(order by sum(pay_amount) desc ) as num
from
user_sales_table
group by user_name) t1
where t1.num =1;
连续登陆不去重版
game表
user_name 用户名
date 用户登陆时间
现在老板想知道连续7天都登陆平台的重要用户。
输出要求如下:
user_name 用户名(连续7天都登陆的用户数)
#row_number() date_sub() cast()
分组去重(不对日期进行切分) 对处理过的时间进行分组 查询了所有登录日期的连续情况
select
t3.name,
count(1) as times
from
(
select
date_sub(cast(date as DATE),n) as sumdate
from
(
select
*,
row_number() over(partition by t1.user_name order by date) as n
(
select
*
from
game
group by user_name,date) t1) t2 ) t3
group by t3.name ,t3.sumdate
having times >= 7;
分析思路二
#分析思路二 lag函数 偏移7个单位如果 date_sub()日期偏移7相等那么用户登录的天数就是连续7天
# 问题是lag 能处理字符串 ---lag函数是一种使得字段偏移的函数
select
t2.name
from
(
select
from
user_name ,
t1.date,
lag(date,7,0) over(partition by name order by date ) as date1
(
select
*
from
game
group by user_name ,date) t1 ) t2
where t2.date is not null and date_sub(cast (t2.date as DATE) ,7) = cast(date1 as DATE )