数据大师:
Jmx's Blog | Keep it Simple and Stupid!
猴子 - 知乎公众号(猴子数据分析)著有畅销书《数据分析思维》科普中国专家 回答数 647,获得 171,083 次赞同https://www.zhihu.com/people/houziliaorenwu
一、窗口函数:
1、通俗易懂的学会:SQL窗口函数 - 知乎一.窗口函数有什么用? 在日常工作中,经常会遇到需要在每组内排名,比如下面的业务需求:排名问题:每个部门按业绩来排名 topN问题:找出每个部门排名前N的员工进行奖励面对这类需求,就需要使用sql的高级功能窗…https://zhuanlan.zhihu.com/p/92654574
2、
4分钟了解什么是SQL窗口函数 - 51CTO.COM你也许很熟悉SQL的简单查询,比如使用SELECT FROM WHERE GROUP BY这样的基础语句,但是如果你想进一步提升自己的SQL技能,你不能不知道窗口函数(Window Function),又被叫做分析函数(Analytics Function)。https://database.51cto.com/art/202101/639239.htm
3、
Hive常用函数大全(二)(窗口函数、分析函数、增强group)_吃果冻不吐果冻皮-CSDN博客_hive 窗口函数关系运算## > < =##注意: String 的比较要注意(常用的时间比较可以先 to_date 之后再比较)select long_time>short_time, long_timehttps://blog.csdn.net/scgaliguodong123_/article/details/60135385,long_time=short_time,>
4、
Hive:窗口函数_不花的花和尚的博客-CSDN博客_窗口函数https://blog.csdn.net/weixin_38750084/article/details/82779910
5、
MySQL操作实战(二):窗口函数_陆-CSDN博客_mysql窗口函数https://blog.csdn.net/weixin_39010770/article/details/87862407
5.1、
HIVE 常用函数总结 - 知乎https://zhuanlan.zhihu.com/p/102502175
6、
MYSQL窗口函数 - 知乎https://zhuanlan.zhihu.com/p/138282683
7、
MySQL - SQL窗口函数_william_n的博客-CSDN博客_mysql 窗口函数https://blog.csdn.net/william_n/article/details/103236086
7.1、
MySQL FIRST_VALUE 函数 | 新手教程https://www.begtut.com/mysql/mysql-first_value-function.html
7.2、
mysql窗口函数排名_MySQL 8.0 窗口函数 排名、topN问题_姜东凯的博客-CSDN博客https://blog.csdn.net/weixin_42510924/article/details/113301371
7.3、(面试题,重点)
经典Hive-SQL面试题_loay-_-的博客-CSDN博客_hive sql 面试题https://blog.csdn.net/weixin_41836276/article/details/106289034
7.4、
Hive Sql中六种面试题型总结_lightupworld的博客-CSDN博客https://blog.csdn.net/lightupworld/article/details/108583548
7.5、数仓建设流程
数仓建设流程_lightupworld的博客-CSDN博客_数仓建设流程https://blog.csdn.net/lightupworld/article/details/108513990
7.6、
Hive窗口分析函数(案例详细讲解)_lightupworld的博客-CSDN博客_hive窗口分析函数https://blog.csdn.net/lightupworld/article/details/108520149
7.7、
MySQL中的窗口函数 - 别看窗外的世界 - 博客园https://www.cnblogs.com/kate7/p/13291744.html
7.8、(类似同事做的题)
https://www.iteye.com/blog/53873039oycg-2020836https://www.iteye.com/blog/53873039oycg-2020836
8、
总结
1.窗口函数语法
<窗口函数> over (partition by <用于分组的列名>
order by <用于排序的列名>)
<窗口函数>的位置,可以放以下两种函数:
1) 专用窗口函数,比如rank, dense_rank, row_number等
2) 聚合函数,如sum. avg, count, max, min等
2.窗口函数有以下功能:
1)同时具有分组(partition by)和排序(order by)的功能
2)不减少原表的行数,所以经常用来在每组内排名
3.注意事项
窗口函数原则上只能写在select子句中
4.窗口函数使用场景
1)业务需求“在每组内排名”,比如:
排名问题:每个部门按业绩来排名
topN问题:找出每个部门排名前N的员工进行奖励
5、
在深入研究Over字句之前,一定要注意:在SQL处理中,窗口函数都是最后一步执行,而且仅位于Order by字句之前。
9、举例说明:窗口函数的应用:
初始化表:
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;
-- ----------------------------
-- Table structure for bj_table
-- ----------------------------
DROP TABLE IF EXISTS `bj_table`;
CREATE TABLE `bj_table` (
`xuehaoid` int(0) NOT NULL,
`banji` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`chengji` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
PRIMARY KEY (`xuehaoid`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
-- ----------------------------
-- Records of bj_table
-- ----------------------------
INSERT INTO `bj_table` VALUES (1, '1', '86');
INSERT INTO `bj_table` VALUES (2, '1', '95');
INSERT INTO `bj_table` VALUES (3, '2', '89');
INSERT INTO `bj_table` VALUES (4, '1', '83');
INSERT INTO `bj_table` VALUES (5, '2', '86');
INSERT INTO `bj_table` VALUES (6, '3', '92');
INSERT INTO `bj_table` VALUES (7, '3', '86');
INSERT INTO `bj_table` VALUES (8, '1', '86');
SET FOREIGN_KEY_CHECKS = 1;
1、按班级分组后,再按成绩倒叙排列:
select *,
RANK() over(PARTITION by banji
ORDER BY chengji desc) as rank_test,
dense_RANK() over(PARTITION by banji
ORDER BY chengji desc) as dense_RANK_test,
ROW_NUMBER() over(PARTITION by banji
ORDER BY chengji desc) as ROW_NUMBER_test
from bj_table
2、按班级分组后,查询出班级的前两名,人名和成绩都要显示。
select * from(
select *,
RANK() over(PARTITION by banji
ORDER BY chengji desc) as rank_test,
dense_RANK() over(PARTITION by banji
ORDER BY chengji desc) as dense_RANK_test,
ROW_NUMBER() over(PARTITION by banji
ORDER BY chengji desc) as ROW_NUMBER_test
from bj_table) cs_table
#方法1、 where 查询
where cs_table.dense_RANK_test = 1 or cs_table.dense_RANK_test = 2
#方法2、 where 查询
#where cs_table.dense_RANK_test in (1, 2)
9.1
CREATE TABLE overtime (
employee_name VARCHAR(50) NOT NULL,
department VARCHAR(50) NOT NULL,
hours INT NOT NULL,
PRIMARY KEY (employee_name , department)
);
INSERT INTO overtime(employee_name, department, hours)
VALUES('Diane Murphy','Accounting',37),
('Mary Patterson','Accounting',74),
('Jeff Firrelli','Accounting',40),
('William Patterson','Finance',58),
('Gerard Bondur','Finance',47),
('Anthony Bow','Finance',66),
('Leslie Jennings','IT',90),
('Leslie Thompson','IT',88),
('Julie Firrelli','Sales',81),
('Steve Patterson','Sales',29),
('Foon Yue Tseng','Sales',65),
('George Vanauf','Marketing',89),
('Loui Bondur','Marketing',49),
('Gerard Hernandez','Marketing',66),
('Pamela Castillo','SCM',96),
('Larry Bott','SCM',100),
('Barry Jones','SCM',65);
# 举例说明,FIRST_VALUE的用法,下面两条sql语句输出结果一致,但是,
# FIRST_VALUE(),不是字面上只取第一条数据。
# 可以通过rank(),DENSE_RANK()或 ROW_NUMBER()实现或替代FIRST_VALUE()功能。
select employee_name,hours from
(SELECT
employee_name,
hours,
FIRST_VALUE(employee_name) OVER (
ORDER BY hours
) least_over_time
FROM
overtime) cs_t1
-- ====================
select employee_name,hours from (
select employee_name,
hours,
DENSE_RANK() over (ORDER BY hours) as t_num
from overtime) cs_t2
# ====================================
-- 以下语句查找每个部门加班时间最少的员工,并按小时升序排列。
select * from (
SELECT
employee_name,
department,
hours,
FIRST_VALUE(employee_name) OVER (
PARTITION BY department
ORDER BY hours
) least_over_time
FROM
overtime) stsb
where stsb.employee_name = stsb.least_over_time
ORDER BY stsb.hours
#--------------------------------------------------------------------------下面是自己随便测试的例子
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;
-- ----------------------------
-- Table structure for test_table
-- ----------------------------
DROP TABLE IF EXISTS `test_table`;
CREATE TABLE `test_table` (
`id` bigint(6) NULL DEFAULT NULL,
`province` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`city` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`uname` varchar(255) CHARACTER SET utf8 COLLATE utf8_general_ci NULL DEFAULT NULL,
`money` bigint(255) NULL DEFAULT NULL,
INDEX `sy_name`(`id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;
-- ----------------------------
-- Records of test_table
-- ----------------------------
INSERT INTO `test_table` VALUES (1, '南方地区', '深圳', '张三', 100);
INSERT INTO `test_table` VALUES (2, '南方地区', '广州', '董州', 39);
INSERT INTO `test_table` VALUES (3, '南方地区', '东莞', '黄丽', 56);
INSERT INTO `test_table` VALUES (4, '中原地区', '上海', '郭广昌', 109);
INSERT INTO `test_table` VALUES (5, '中原地区', '杭州', '马云', 980);
INSERT INTO `test_table` VALUES (6, '中原地区', '郑州', '许家印', 101);
INSERT INTO `test_table` VALUES (7, '北方地区', '北京', '王健林', 505);
INSERT INTO `test_table` VALUES (8, '北方地区', '哈尔滨', '付强', 21);
INSERT INTO `test_table` VALUES (9, '北方地区', '铁岭', '赵本山', 86);
SET FOREIGN_KEY_CHECKS = 1;
-- 每个地区最有钱的人名
select ta.*
from
(select t.*,
DENSE_RANK() over (PARTITION by t.province ORDER BY t.money DESC) as d_r_n
from test_table t)ta
where ta.d_r_n <=1
select * from test_table;
-- -- 方法2
select tab.*
from(
select ta.*
from
test_table ta)tab
INNER JOIN
(select t.province as tprovince,MAX(t.money) as tmoney
from test_table t
GROUP BY t.province) tac
on tab.province = tac.tprovince and tab.money = tac.tmoney
-- --------------------- 例子:
-- 1、按地区分组后,再对money进行排序。用DENSE_BANK()方法。
-- 2、地区分组后,money迭代增加,用sum(迭代增加的字段),这里面有个坑
-- 必须带 order by 方法,不带迭代增加错误。
select tt.*,
DENSE_RANK() over (PARTITION by tt.province ORDER BY tt.money) as num,
SUM(tt.money) over (PARTITION by tt.province ORDER BY tt.money) as total_money
from test_table tt
--------------------------------------------------
-- LAST_VALUE 与 FIRST_VALUE 的用法。
-- LAST_VALUE 的用法就是去分组后的最后一个值,
-- 不能用order by。因为这样功能就和FIRST_VALUE的功能重复。
select tt.id, tt.province, tt.city,tt.uname, tt.money,
LAST_VALUE(tt.money) over (PARTITION by tt.province)
from test_table tt
-- FIRST_VALUE 的用法,正反排序都可以,都是取第一个值。
-- 所以,LAST_VALUE() 再正反向排序,就和FIRST_VALUE用法重复了。
select tt.id, tt.province, tt.city,tt.uname, tt.money,
FIRST_VALUE(tt.money) over (PARTITION by tt.province order by tt.money)
from test_table tt
例子来源:
https://www.cnblogs.com/zmoumou/p/10222127.html
select * from test_table tsd where
tsd.province in(select tt.province from test_table tt
GROUP BY tt.province desc)
and
tsd.money in (select MAX(tt.money)
from test_table tt
GROUP BY tt.province desc)
ORDER BY money desc
# 用多种方法,实现,地区中金额最大的人。其实,地区,城市,人名,金额,都要查询出来。
# MySql8.0以上版本,不支持上面的查询:需要去掉子表中的desc 关键字:如下才行。
select * from test_table tsd where
tsd.province in(select tt.province from test_table tt
GROUP BY tt.province )
and
tsd.money in (select MAX(tt.money)
from test_table tt
GROUP BY tt.province )
ORDER BY money desc;
# 类似用python 赋值的条件查询
select * from test_table tsd where
(tsd.province, tsd.money) = ("南方地区", "100")
用sql语句查询MySQL安装路径和版本
mysql安装路径
SELECT @@basedir AS basePath FROM DUAL
版本
SELECT VERSION() FROM DUAL
-- 通过行号,直接定位到哪一行数据
-- 方法1
select ta.*
from
(select (ROW_NUMBER() over ()) as rn,t.*
from test_table t) ta
where ta.rn = 9;
-- 方法2、(下标从零开始,8代表9行。1代表共查询几条数据)
select t.*
from test_table t
limit 8,1
面试题例子如下:
1、第一题
-- 查询表
select * from test1;
-- 创建表
CREATE TABLE test1 (
userId varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NOT NULL,
visitDate varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NOT NULL,
visitCount int(0) NOT NULL
)
-- 插入信息
INSERT INTO test1
(userId, visitDate, visitCount)VALUES
( 'u01', '2017/1/21', 5 );
INSERT INTO test1
(userId, visitDate, visitCount)VALUES
( 'u02', '2017/1/23', 6 );
INSERT INTO test1
(userId, visitDate, visitCount)VALUES
( 'u03', '2017/1/22', 8 );
INSERT INTO test1
(userId, visitDate, visitCount)VALUES
( 'u04', '2017/1/20', 3 );
INSERT INTO test1
(userId, visitDate, visitCount)VALUES
( 'u01', '2017/1/23', 6 );
INSERT INTO test1
(userId, visitDate, visitCount)VALUES
( 'u01', '2017/2/21', 8 );
INSERT INTO test1
(userId, visitDate, visitCount)VALUES
( 'u02', '2017/1/23', 6 );
INSERT INTO test1
(userId, visitDate, visitCount)VALUES
( 'u01', '2017/2/22', 4 );
-- ---- 查询要求:
-- 要求使用SQL统计出每个用户的累积访问次数,如下表所示:
-- 用户id 月份 小计 累积
-- u01 2017-01 11 11
-- u01 2017-02 12 23
-- u02 2017-01 12 12
-- u03 2017-01 8 8
-- u04 2017-01 3 3
-- 方法1、
select abc.abauid, abc.abavisitmonth, abc.nums,
sum(abc.nums) over (PARTITION by abc.abauid ORDER BY abc.nums) as c_total
from (
select CONCAT(ab.u_time,ab.auid) as abu_time , ab.auid as abauid, ab.avisitmonth as abavisitmonth, SUM(ab.anum_total) as nums from
(select CONCAT(a.uid,a.visitmonth) as u_time,
a.uid as auid, a.visitmonth as avisitmonth, a.num_total as anum_total
from (
SELECT userid as uid,
REPLACE(STR_TO_DATE(visitDate,"%Y/%m"), "-00", "")
AS visitmonth,
visitcount as num_total
FROM test1
ORDER BY uid,visitmonth) a) ab
GROUP BY ab.u_time
ORDER BY ab.auid ) abc
-- 方法2
-- 测试用
-- select * from test1;
-- 测试用
-- select date_format(t.visitDate, '%Y-%m') as t_time from test1 t;
select tab.*,
SUM(tab.tatvisitCount) over (PARTITION by tab.tatuserId ORDER BY tab.tatvisitCount) as total_num
FROM
(SELECT
ta.tuserId AS tatuserId,
ta.t_time AS tat_time,
SUM( ta.tvisitCount ) AS tatvisitCount
FROM
( SELECT t.userId AS tuserId, date_format( t.visitDate, '%Y-%m' ) AS t_time,
t.visitCount AS tvisitCount FROM test1 t ) ta
GROUP BY
CONCAT( ta.tuserId, ta.t_time )
ORDER BY
tatuserId) tab
# --------------------------下面是自己练习用的
-- 年月日取法(可以任意混合取)
select date_format(t.visitDate, '%Y') as t_time from test1 t;
select date_format(t.visitDate, '%m') as t_time from test1 t;
select date_format(t.visitDate, '%d') as t_time from test1 t;
-- 混合任意取
select date_format(t.visitDate, '%Y-%m') as t_time from test1 t;
select date_format(t.visitDate, '%m-%d') as t_time from test1 t;
-- YEAR,MONTH,DAY的用法
select YEAR(date_format(t.visitDate, '%Y-%m-%d')) as t_time from test1 t;
select MONTH((date_format(t.visitDate, '%Y-%m-%d'))) as t_time from test1 t;
select DAY((date_format(t.visitDate, '%Y-%m-%d'))) as t_time from test1 t;
第二题:
-- 创建表
CREATE TABLE test2 (
user_id varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NOT NULL,
shop varchar(50) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NOT NULL
)
-- 插入数据
INSERT INTO test2 VALUES
( 'u1', 'a' ),
( 'u2', 'b' ),
( 'u1', 'b' ),
( 'u1', 'a' ),
( 'u3', 'c' ),
( 'u4', 'b' ),
( 'u1', 'a' ),
( 'u2', 'c' ),
( 'u5', 'b' ),
( 'u4', 'b' ),
( 'u6', 'c' ),
( 'u2', 'c' ),
( 'u1', 'b' ),
( 'u2', 'a' ),
( 'u2', 'a' ),
( 'u3', 'a' ),
( 'u5', 'a' ),
( 'u5', 'a' ),
( 'u5', 'a' );
-- 查询表
select * from test2;
请统计:
(1)每个店铺的UV(访客数)
-- 方法1、
select t.shop, count(DISTINCT t.user_id)
from test2 t
group by t.shop;
-- 方法2、
select ta.tshop, count(ta.tuserid)
from
(select DISTINCT CONCAT(t.user_id,t.shop), t.shop as tshop,
t.user_id as tuserid from test2 t) ta
GROUP BY ta.tshop
-- 方法3、
SELECT
t.shop,
count(*)
FROM
( SELECT user_id, shop FROM test2 GROUP BY user_id, shop ) t
GROUP BY
t.shop;
请统计:
(2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数
-- 1、自己写的答案:
select
tab.tatshop, tab.tatuser_id, tab.tasf
from
(select ta.tshop as tatshop, ta.tuser_id as tatuser_id, ta.sf as tasf,
ROW_NUMBER() over (PARTITION by ta.tshop ORDER BY ta.sf DESC) as R_NUM
from (
SELECT t.shop as tshop, t.user_id as tuser_id, count(user_id) as sf
FROM test2 t
GROUP BY tshop, tuser_id
order by tshop, sf DESC) ta)tab
where tab.R_NUM <= 3;
---下面是教科书上的答案
SELECT t2.shop,
t2.user_id,
t2.cnt
FROM
(SELECT t1.*,
row_number() over (partition BY t1.shop
ORDER BY t1.cnt DESC) as ranks
FROM
(SELECT user_id,
shop,
count(*) AS cnt
FROM test2
GROUP BY user_id,
shop) t1)t2
WHERE ranks <= 3;
第三道题:
-- 第三题
-- 需求
--
-- 已知一个表STG.ORDER,有如下字段:Date,Order_id,User_id,amount。
-- 数据样例:2017-01-01,10029028,1000003251,33.57。
-- 请给出sql进行统计:
-- (1)给出 2017年每个月的订单数、用户数、总成交金额。
-- (2)给出2017年11月的新客数(指在11月才有第一笔订单)
DROP table test3;
-- 创建表
CREATE TABLE test3 (
dt varchar(50),
order_id varchar(50),
user_id varchar(50),
amount FLOAT ( 10, 2 ) );
-- 插入数据
INSERT INTO test3 VALUES ('2017-01-01','10029028','1000003251',33.57);
INSERT INTO test3 VALUES ('2017-01-01','10029029','1000003251',33.57);
INSERT INTO test3 VALUES ('2017-01-01','100290288','1000003252',33.57);
INSERT INTO test3 VALUES ('2017-02-02','10029088','1000003251',33.57);
INSERT INTO test3 VALUES ('2017-02-02','100290281','1000003251',33.57);
INSERT INTO test3 VALUES ('2017-02-02','100290282','1000003253',33.57);
INSERT INTO test3 VALUES ('2017-11-02','10290282','100003253',234);
INSERT INTO test3 VALUES ('2018-11-02','10290284','100003243',234);
-- 查收表
select * from test3;
-- (1)给出 2017年每个月的订单数、用户数、总成交金额。
-- 自己写得答案:
-- 方法1
select date_format(t.dt, '%Y-%m') as t_time, count(t.order_id) as order_num,count(DISTINCT t.user_id) as user_num
,sum(t.amount) as amount_total
from
test3 t
GROUP BY t_time
HAVING left(t_time, 4) = '2017';
-- 也可用模糊查询,刷选出结果。
-- HAVING t_time LIKE '%2017%';
-- 网上给出的答案:
-- 方法2
SELECT t1.mon,
count(t1.order_id) AS order_cnt,
count(DISTINCT t1.user_id) AS user_cnt,
sum(amount) AS total_amount
FROM
(SELECT order_id,
user_id,
amount,
date_format(dt,'%Y-%m') mon
FROM test3
WHERE date_format(dt,'%Y') = '2017') t1
GROUP BY t1.mon;
-- (2)给出2017年11月的新客数(指在11月才有第一笔订单)
SELECT count(user_id)
FROM test3
GROUP BY user_id
HAVING date_format(min(dt),'%Y-%m')='2017-11';
第四题:
-- 第四题
-- 需求
-- 1、有一个5000万的用户文件(user_id,name,age),一个2亿记录的用户看电影的记录文件(user_id,url),根据年龄段观看电影的次数进行排序?
-- 创建表
CREATE TABLE test4user
(user_id varchar(50),
name varchar(50),
age int);
select * from test4user;
CREATE TABLE test4log
(user_id varchar(50),
url varchar(50));
select * from test4log;
-- 插入数据
INSERT INTO test4user VALUES('001','u1',10);
INSERT INTO test4user VALUES('002','u2',15);
INSERT INTO test4user VALUES('003','u3',15);
INSERT INTO test4user VALUES('004','u4',20);
INSERT INTO test4user VALUES('005','u5',25);
INSERT INTO test4user VALUES('006','u6',35);
INSERT INTO test4user VALUES('007','u7',40);
INSERT INTO test4user VALUES('008','u8',45);
INSERT INTO test4user VALUES('009','u9',50);
INSERT INTO test4user VALUES('0010','u10',65);
INSERT INTO test4log VALUES('001','url1');
INSERT INTO test4log VALUES('002','url1');
INSERT INTO test4log VALUES('003','url2');
INSERT INTO test4log VALUES('004','url3');
INSERT INTO test4log VALUES('005','url3');
INSERT INTO test4log VALUES('006','url1');
INSERT INTO test4log VALUES('007','url5');
INSERT INTO test4log VALUES('008','url7');
INSERT INTO test4log VALUES('009','url5');
INSERT INTO test4log VALUES('0010','url1');
-- 查询结果
-- 1、有一个5000万的用户文件(user_id,name,age),一个2亿记录的用户看电影的记录文件(user_id,url),根据年龄段观看电影的次数进行排序?
select * from test4user;
select * from test4log;
-- 自己写得答案
select tab.taage_phase, count(tab.taage_phase) as view_num
from
(
select t2.user_id as t2user_id,t2.url as t2url,
ta.tage as tatage,ta.age_phase as taage_phase
from
(SELECT t.user_id as tuser_id, t.age as tage,
CASE WHEN age <= 10 AND age > 0 THEN '0-10'
WHEN age <= 20 AND age > 10 THEN '10-20'
WHEN age >20 AND age <=30 THEN '20-30'
WHEN age >30 AND age <=40 THEN '30-40'
WHEN age >40 AND age <=50 THEN '40-50'
WHEN age >50 AND age <=60 THEN '50-60'
WHEN age >60 AND age <=70 THEN '60-70'
ELSE '70以上' END as age_phase
FROM test4user t) ta
RIGHT JOIN test4log t2 on ta.tuser_id = t2.user_id) tab
GROUP BY tab.taage_phase;
-- 网上给出的答案
SELECT
t2.age_phase,
sum(t1.cnt) as view_cnt
FROM
(SELECT user_id,
count(*) cnt
FROM test4log
GROUP BY user_id) t1
JOIN
(SELECT user_id,
CASE WHEN age <= 10 AND age > 0 THEN '0-10'
WHEN age <= 20 AND age > 10 THEN '10-20'
WHEN age >20 AND age <=30 THEN '20-30'
WHEN age >30 AND age <=40 THEN '30-40'
WHEN age >40 AND age <=50 THEN '40-50'
WHEN age >50 AND age <=60 THEN '50-60'
WHEN age >60 AND age <=70 THEN '60-70'
ELSE '70以上' END as age_phase
FROM test4user) t2 ON t1.user_id = t2.user_id
GROUP BY t2.age_phase;
第五题:
第五题:
要求如下:
有日志如下,请写出代码,求得所有用户和活跃用户的总数及平均年龄。
(活跃用户指连续两天都有访问记录的用户)
日期 用户 年龄
2019-02-11,test_1,23
2019-02-11,test_2,19
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-11,test_3,39
2019-02-11,test_1,23
2019-02-12,test_2,19
2019-02-13,test_1,23
2019-02-15,test_2,19
2019-02-16,test_2,19
-------- 创建表
CREATE TABLE test5(
dt varchar(50),
user_id varchar(50),
age int)
-- -------创建数据
INSERT INTO test5 VALUES ('2019-02-11','test_1',23);
INSERT INTO test5 VALUES ('2019-02-11','test_2',19);
INSERT INTO test5 VALUES ('2019-02-11','test_3',39);
INSERT INTO test5 VALUES ('2019-02-11','test_1',23);
INSERT INTO test5 VALUES ('2019-02-11','test_3',39);
INSERT INTO test5 VALUES ('2019-02-11','test_1',23);
INSERT INTO test5 VALUES ('2019-02-12','test_2',19);
INSERT INTO test5 VALUES ('2019-02-13','test_1',23);
INSERT INTO test5 VALUES ('2019-02-15','test_2',19);
INSERT INTO test5 VALUES ('2019-02-16','test_2',19);
-- 查询表
select * from test5
-- 自己的答案:
select
count(tosuccess.tyatyuser_id),
sum(tosuccess.tyatyage)/count(tosuccess.tyatyuser_id),
count(tosuccess.ttaabtat1user_id),
sum(tosuccess.ttaabtat1age)/count(tosuccess.ttaabtat1user_id)
from
(select
tya.tyuser_id as tyatyuser_id, ttaab.tat1user_id as ttaabtat1user_id,
tya.tyage as tyatyage,ttaab.tat1age as ttaabtat1age
from
( select ty.dt as tydt, ty.user_id as tyuser_id,ty.age as tyage
from test5 ty
ORDER BY ty.user_id) tya
LEFT JOIN
(select DISTINCT ta.t1user_id as tat1user_id, ta.t1age as tat1age
from
(select t1.age as t1age, t1.dt as t1dt, t1.user_id as t1user_id
from test5 t1
ORDER BY t1.user_id) ta
INNER JOIN
(SELECT DATE_SUB(t2.dt,INTERVAL -1 DAY) AS t2dt,
t2.user_id as t2user_id
FROM test5 t2
ORDER BY t2.user_id) tb
on ta.t1user_id = tb.t2user_id and ta.t1dt = tb.t2dt) ttaab
on
tya.tyuser_id = ttaab.tat1user_id
GROUP BY tya.tyuser_id,tya.tyage) tosuccess
-- 网上的查询结果
SELECT sum(total_user_cnt) total_user_cnt,
sum(total_user_avg_age) total_user_avg_age,
sum(two_days_cnt) two_days_cnt,
sum(avg_age) avg_age
FROM
(SELECT 0 total_user_cnt,
0 total_user_avg_age,
count(*) AS two_days_cnt,
cast(sum(age) / count(*) AS decimal(5,2)) AS avg_age
FROM
(SELECT user_id,
max(age) age
FROM
(SELECT user_id,
max(age) age
FROM
(SELECT user_id,
age,
DATE_SUB(dt,INTERVAL rank_num DAY) as flagcc
FROM
(SELECT dt,
user_id,
max(age) age,
row_number() over (PARTITION BY user_id
ORDER BY dt) as rank_num
FROM test5
GROUP BY dt,
user_id) t1) t2
GROUP BY user_id,
flagcc
HAVING count(*) >=2) t3
GROUP BY user_id) t4
UNION ALL SELECT count(*) total_user_cnt,
cast(sum(age) /count(*) AS decimal(5,2)) total_user_avg_age,
0 two_days_cnt,
0 avg_age
FROM
(SELECT user_id,
max(age) age
FROM test5
GROUP BY user_id) t5) t6
--- 同事写得(查询出连续2天活跃用户)
select distinct user_id from (
SELECT * ,
case when @name=user_id then (
CASE
WHEN DATE_SUB(str_to_date(dt,'%Y-%m-%d'),INTERVAL 1 DAY) = str_to_date(@old,'%Y-%m-%d') and @old:=dt THEN @size:=@size+1
WHEN @old:=dt then @size:=1
END
)
when @name:=user_id and @old:=dt then @size:=1
end
AS tt
FROM test5,(SELECT @old:=null,@size:=1,@name:=null)r
ORDER BY user_id,dt
) t_ where tt = 2;
------------- 自己写的另一种方法(查询连续活跃2天的用户)
select DISTINCT ta.t1user_id from
(select t1.dt as t1dt, t1.user_id as t1user_id
from test5 t1
ORDER BY t1.user_id) ta
INNER JOIN
(SELECT DATE_SUB(t2.dt,INTERVAL -1 DAY) AS t2dt,
t2.user_id as t2user_id
FROM test5 t2
ORDER BY t2.user_id) tb
on ta.t1user_id = tb.t2user_id and ta.t1dt = tb.t2dt
第六题
第六题
需求
请用sql写出所有用户中在今年10月份第一次购买商品的金额,
表ordertable字段:
(购买用户:userid,金额:money,
购买时间:paymenttime(格式:2017-10-01),订单id:orderid
实现
数据准备
select * from test6;
CREATE TABLE test6 (
userid varchar(50),
money FLOAT(10,2),
paymenttime varchar(50),
orderid varchar(50));
INSERT INTO test6 VALUES('007',100,'2017-09-01','133');
INSERT INTO test6 VALUES('007',200,'2017-10-02','134');
INSERT INTO test6 VALUES('010',500,'2017-10-01','135');
INSERT INTO test6 VALUES('011',100,'2017-08-01','136');
INSERT INTO test6 VALUES('011',100,'2018-10-11','137');
select * from test6;
-- 自己写得查询
select ta.*
from
(select t.userid as tuserid, t.money as tmoney, t.orderid as torderid,
t.paymenttime as tpaymenttime,
date_format(t.paymenttime, '%Y-%m') as t_y_m,
DENSE_RANK() over (PARTITION by t.userid ORDER BY t.paymenttime) as d_r_num
from test6 t)ta
where ta.t_y_m = '2017-10' and ta.d_r_num = 1
第十题
第十题
需求
有一个账号表如下,请写出SQL语句,查询各自区组的money排名前十的账号(分组取前10)
dist_id string '区组id',
account string '账号',
gold int '金币'
实现
数据准备
创建表,插入数据
CREATE TABLE test10(
dist_id varchar(50) COMMENT '区组id',
account varchar(50) COMMENT '账号',
gold int COMMENT '金币'
);
select * from test10;
INSERT INTO test10 VALUES ('1','77',18);
INSERT INTO test10 VALUES ('1','88',106);
INSERT INTO test10 VALUES ('1','99',10);
INSERT INTO test10 VALUES ('1','12',13);
INSERT INTO test10 VALUES ('1','13',14);
INSERT INTO test10 VALUES ('1','14',25);
INSERT INTO test10 VALUES ('1','15',36);
INSERT INTO test10 VALUES ('1','16',12);
INSERT INTO test10 VALUES ('1','17',158);
INSERT INTO test10 VALUES ('2','18',12);
INSERT INTO test10 VALUES ('2','19',44);
INSERT INTO test10 VALUES ('2','10',66);
INSERT INTO test10 VALUES ('2','45',80);
INSERT INTO test10 VALUES ('2','78',98);
select * from test10;
select ta.*
from
(select t.dist_id,t.account,t.gold,
DENSE_RANK() over (PARTITION by t.dist_id order by t.gold DESC) as d_r_n
from test10 t)ta
where ta.d_r_n<11
第九题
第九题
需求
有一个充值日志表credit_log,字段如下:
`dist_id` int '区组id',
`account` string '账号',
`money` int '充值金额',
`create_time` string '订单时间'
请写出SQL语句,查询充值日志表2019年01月02号每个区组下充值额最大的账号,要求结果:
区组id,账号,金额,充值时间
-- 创建表,插入数据
CREATE TABLE test9(
dist_id varchar(50) COMMENT '区组id',
account varchar(50) COMMENT '账号',
money FLOAT(10,2) COMMENT '充值金额',
create_time varchar(50) COMMENT '订单时间');
select * from test9;
-- DELETE from test9;
INSERT INTO test9 VALUES ('1','11',100006,'2019-01-02 13:00:01');
INSERT INTO test9 VALUES ('1','11',100066,'2019-01-02 13:13:13');
INSERT INTO test9 VALUES ('1','11',100666,'2019-01-02 15:55:55');
INSERT INTO test9 VALUES ('1','22',110000,'2019-01-02 13:00:02');
INSERT INTO test9 VALUES ('1','22',118888,'2019-01-02 18:58:58');
INSERT INTO test9 VALUES ('1','33',102000,'2019-01-02 13:00:03');
INSERT INTO test9 VALUES ('1','44',100300,'2019-01-02 13:00:04');
INSERT INTO test9 VALUES ('1','55',100040,'2019-01-02 13:00:05');
INSERT INTO test9 VALUES ('1','66',100005,'2019-01-02 13:00:06');
INSERT INTO test9 VALUES ('1','77',180000,'2019-01-03 13:00:07');
INSERT INTO test9 VALUES ('1','88',106000,'2019-01-02 13:00:08');
INSERT INTO test9 VALUES ('1','99',100400,'2019-01-02 13:00:09');
INSERT INTO test9 VALUES ('1','12',100030,'2019-01-02 13:00:10');
INSERT INTO test9 VALUES ('1','13',100003,'2019-01-02 13:00:20');
INSERT INTO test9 VALUES ('1','14',100020,'2019-01-02 13:00:30');
INSERT INTO test9 VALUES ('1','15',100500,'2019-01-02 13:00:40');
INSERT INTO test9 VALUES ('1','16',106000,'2019-01-02 13:00:50');
INSERT INTO test9 VALUES ('1','17',100800,'2019-01-02 13:00:59');
INSERT INTO test9 VALUES ('2','18',100800,'2019-01-02 13:00:11');
INSERT INTO test9 VALUES ('2','19',100030,'2019-01-02 13:00:12');
INSERT INTO test9 VALUES ('2','10',100000,'2019-01-02 13:00:13');
INSERT INTO test9 VALUES ('2','45',100010,'2019-01-02 13:00:14');
INSERT INTO test9 VALUES ('2','78',100070,'2019-01-02 13:00:15');
INSERT INTO test9 VALUES ('3','78',100080,'2019-01-02 16:00:56');
select * from test9;
-- 自己写得方法
select tab.*
from
(select ta.tadist_id,ta.taaccount,ta.sum_money,
DENSE_RANK() over (PARTITION by ta.tadist_id ORDER BY ta.sum_money DESC) as d_r_n
from
(select t.dist_id as tadist_id,t.account as taaccount,
SUM(t.money) as sum_money
from test9 t
where LEFT(t.create_time,10) = '2019-01-02'
GROUP BY t.dist_id,t.account)ta)tab
where d_r_n = 1
-- 备注(条件查询):
-- where 方法2
-- where t.create_time like '2019-01-02%'
-- where 方法3
-- where DATE_FORMAT(t.create_time,'%Y-%m-%d') = '2019-01-02'
-- 网上的方法、
WITH TEMP AS
(SELECT dist_id,
account,
sum(money) sum_money
FROM test9
WHERE date_format(create_time,'%Y-%m-%d') = '2019-01-02'
GROUP BY dist_id,
account)
SELECT t1.dist_id,
t1.account,
t1.sum_money
FROM
(SELECT temp.dist_id,
temp.account,
temp.sum_money,
rank() over(partition BY temp.dist_id
ORDER BY temp.sum_money DESC) ranks
FROM TEMP) t1
WHERE ranks = 1;
第八题
第八题
需求
有一个线上服务器访问日志格式如下(用sql答题)
时间 接口 ip地址
2016-11-09 14:22:05 /api/user/login 110.23.5.33
2016-11-09 14:23:10 /api/user/detail 57.3.2.16
2016-11-09 15:59:40 /api/user/login 200.6.5.166
… …
求11月9号下午14点(14-15点),访问/api/user/login接口的top10的ip地址
创建表,插入数据
CREATE TABLE test8
(`date` varchar(50),
interface varchar(50),
ip varchar(50));
select * from test8;
INSERT INTO test8 VALUES ('2016-11-09 11:22:05','/api/user/login','110.23.5.23');
INSERT INTO test8 VALUES ('2016-11-09 11:23:10','/api/user/detail','57.3.2.16');
INSERT INTO test8 VALUES ('2016-11-09 23:59:40','/api/user/login','200.6.5.166');
INSERT INTO test8 VALUES('2016-11-09 11:14:23','/api/user/login','136.79.47.70');
INSERT INTO test8 VALUES('2016-11-09 11:15:23','/api/user/detail','94.144.143.141');
INSERT INTO test8 VALUES('2016-11-09 11:16:23','/api/user/login','197.161.8.206');
INSERT INTO test8 VALUES('2016-11-09 12:14:23','/api/user/detail','240.227.107.145');
INSERT INTO test8 VALUES('2016-11-09 13:14:23','/api/user/login','79.130.122.205');
INSERT INTO test8 VALUES('2016-11-09 14:14:23','/api/user/detail','65.228.251.189');
INSERT INTO test8 VALUES('2016-11-09 14:15:23','/api/user/detail','245.23.122.44');
INSERT INTO test8 VALUES('2016-11-09 14:17:23','/api/user/detail','22.74.142.137');
INSERT INTO test8 VALUES('2016-11-09 14:19:23','/api/user/detail','54.93.212.87');
INSERT INTO test8 VALUES('2016-11-09 14:20:23','/api/user/detail','218.15.167.248');
INSERT INTO test8 VALUES('2016-11-09 14:24:23','/api/user/detail','20.117.19.75');
INSERT INTO test8 VALUES('2016-11-09 15:14:23','/api/user/login','183.162.66.97');
INSERT INTO test8 VALUES('2016-11-09 16:14:23','/api/user/login','108.181.245.147');
INSERT INTO test8 VALUES('2016-11-09 14:17:23','/api/user/login','22.74.142.137');
INSERT INTO test8 VALUES('2016-11-09 14:19:23','/api/user/login','22.74.142.137');
-- select * from test8;
-- 自己写的方法
select ta.ip, COUNT(ta.ip)
from
(select t.*
from test8 t
where LEFT(t.date,13) = '2016-11-09 14' and t.interface = '/api/user/login'
ORDER BY t.date) ta
GROUP BY ta.ip
LIMIT 10
-- 网上的方法
SELECT ip,
count(*) AS cnt
FROM test8
WHERE date_format(date,'%Y-%m-%d %H') >= '2016-11-09 14'
AND date_format(date,'%Y-%m-%d %H') < '2016-11-09 15'
AND interface='/api/user/login'
GROUP BY ip
ORDER BY cnt desc
LIMIT 10;
-- 字符串时间格式化处理后的显示方式(备注)
select date_format(date,'%Y-%m-%d %H:%i:%s')
from test8;
-- 显示的格式:2016-11-09 11:22:05
select DATE_FORMAT(date,'%Y-%m-%e %H:%i:%s')
from test8;
-- 显示的格式:2016-11-9 11:22:05
第七题
第七题
需求
现有图书管理数据库的三个数据模型如下:
图书(数据表名:BOOK)
序号 字段名称 字段描述 字段类型
1 BOOK_ID 总编号 文本
2 SORT 分类号 文本
3 BOOK_NAME 书名 文本
4 WRITER 作者 文本
5 OUTPUT 出版单位 文本
6 PRICE 单价 数值(保留小数点后2位)
读者(数据表名:READER)
序号 字段名称 字段描述 字段类型
1 READER_ID 借书证号 文本
2 COMPANY 单位 文本
3 NAME 姓名 文本
4 SEX 性别 文本
5 GRADE 职称 文本
6 ADDR 地址 文本
借阅记录(数据表名:BORROW LOG)
序号 字段名称 字段描述 字段类型
1 READER_ID 借书证号 文本
2 BOOK_ID 总编号 文本
3 BORROW_DATE 借书日期 日期
(1)创建图书管理库的图书、读者和借阅三个基本表的表结构。请写出建表语句。
(2)找出姓李的读者姓名(NAME)和所在单位(COMPANY)。
(3)查找“高等教育出版社”的所有图书名称(BOOK_NAME)及单价(PRICE),结果按单价降序排序。
(4)查找价格介于10元和20元之间的图书种类(SORT)出版单位(OUTPUT)和单价(PRICE),结果按出版单位(OUTPUT)和单价(PRICE)升序排序。
(5)查找所有借了书的读者的姓名(NAME)及所在单位(COMPANY)。
(6)求”科学出版社”图书的最高单价、最低单价、平均单价。
(7)找出当前至少借阅了2本图书(大于等于2本)的读者姓名及其所在单位。
(8)考虑到数据安全的需要,需定时将“借阅记录”中数据进行备份,请使用一条SQL语句,在备份用户bak下创建与“借阅记录”表结构完全一致的数据表BORROW_LOG_BAK.井且将“借阅记录”中现有数据全部复制到BORROW_L0G_ BAK中。
(9)现在需要将原Oracle数据库中数据迁移至Hive仓库,请写出“图书”在Hive中的建表语句(Hive实现,提示:列分隔符|;数据表数据需要外部导入:分区分别以month_part、day_part 命名)
(10)Hive中有表A,现在需要将表A的月分区 201505 中 user_id为20000的user_dinner字段更新为bonc8920,其他用户user_dinner字段数据不变,请列出更新的方法步骤。(Hive实现,提示:Hlive中无update语法,请通过其他办法进行数据更新)
创建表,插入数据
(1)
-- 创建图书表book
CREATE TABLE book(book_id varchar(50),
`SORT` varchar(50),
book_name varchar(50),
writer varchar(50),
OUTPUT varchar(50),
price FLOAT(10,2));
select * from book;
INSERT INTO book VALUES ('001','TP391','信息处理','author1','机械工业出版社','20');
INSERT INTO book VALUES ('002','TP392','数据库','author12','科学出版社','15');
INSERT INTO book VALUES ('003','TP393','计算机网络','author3','机械工业出版社','29');
INSERT INTO book VALUES ('004','TP399','微机原理','author4','科学出版社','39');
INSERT INTO book VALUES ('005','C931','管理信息系统','author5','机械工业出版社','40');
INSERT INTO book VALUES ('006','C932','运筹学','author6','科学出版社','55');
-- 创建读者表reader
CREATE TABLE reader (reader_id VARCHAR(200),
company VARCHAR(200),
name VARCHAR(200),
sex VARCHAR(200),
grade VARCHAR(200),
addr VARCHAR(200));
select * from reader;
INSERT INTO reader VALUES ('0001','阿里巴巴','jack','男','vp','addr1');
INSERT INTO reader VALUES ('0002','百度','robin','男','vp','addr2');
INSERT INTO reader VALUES ('0003','腾讯','tony','男','vp','addr3');
INSERT INTO reader VALUES ('0004','京东','jasper','男','cfo','addr4');
INSERT INTO reader VALUES ('0005','网易','zhangsan','女','ceo','addr5');
INSERT INTO reader VALUES ('0006','搜狐','lisi','女','ceo','addr6');
-- 创建借阅记录表borrow_log
CREATE TABLE borrow_log(reader_id VARCHAR(200),
book_id VARCHAR(200),
borrow_date VARCHAR(200));
select * from borrow_log;
INSERT INTO borrow_log VALUES ('0001','002','2019-10-14');
INSERT INTO borrow_log VALUES ('0002','001','2019-10-13');
INSERT INTO borrow_log VALUES ('0003','005','2019-09-14');
INSERT INTO borrow_log VALUES ('0004','006','2019-08-15');
INSERT INTO borrow_log VALUES ('0005','003','2019-10-10');
INSERT INTO borrow_log VALUES ('0006','004','2019-17-13');
(2)
SELECT name,
company
FROM reader
WHERE name LIKE 'j%';
(3)
SELECT book_name,
price
FROM book
WHERE OUTPUT = "高等教育出版社"
ORDER BY price DESC;
(4)
SELECT sort,
output,
price
FROM book
WHERE price >= 10 and price <= 20
ORDER BY output,price ;
(5)
SELECT b.name,
b.company
FROM borrow_log a
JOIN reader b ON a.reader_id = b.reader_id;
(6)
SELECT max(price),
min(price),
avg(price)
FROM book
WHERE OUTPUT = '科学出版社';
(7)
SELECT b.name,
b.company
FROM
(SELECT reader_id
FROM borrow_log
GROUP BY reader_id
HAVING count(*) >= 2) a
JOIN reader b ON a.reader_id = b.reader_id;
(8)
CREATE TABLE borrow_log_bak AS
SELECT *
FROM borrow_log;
select * from borrow_log_bak;
(9)
CREATE TABLE book_hive (
book_id VARCHAR(200),
SORT VARCHAR(200),
book_name VARCHAR(200),
writer VARCHAR(200),
OUTPUT VARCHAR(200),
price FLOAT ( 10, 2 ) )
partitioned BY ( month_part VARCHAR(200), day_part VARCHAR(200))
-- ROW format delimited FIELDS TERMINATED BY '\\|' stored AS textfile;
(10)
方式1:配置hive支持事务操作,分桶表,orc存储格式
方式2:第一步找到要更新的数据,将要更改的字段替换为新的值,第二步找到不需要更新的数据,第三步将上两步的数据插入一张新表中。
1、mysql排序的特殊说明:
用例子说明:(表的创建在上面的内容中)
例子:
select tt.province as ttprovince
from test_table tt
-- where ttprovince = '北方地区'; --(不可以,用别名)
-- group by ttprovince; --(可以用别名)
-- HAVING ttprovince = '北方地区'; --(可以用别名)
-- ORDER BY ttprovince DESC; --(可以用别名)
在mysql中对查询做了,加强控制:列的别名可以在group by 后,having 后 ,order by 后。都可以用别名。
https://blog.csdn.net/qq_26442553/article/details/80867076
-- Hive窗口函数之LAG、LEAD的用法
create table windows_ss
(
polno varchar(300),
eff_date varchar(300),
userno varchar(300)
);
select * from windows_ss;
INSERT INTO windows_ss VALUES ("P066666666666","2016-04-02 09:00:02","user01");
INSERT INTO windows_ss VALUES ("P066666666666","2016-04-02 09:00:00","user02");
INSERT INTO windows_ss VALUES ("P066666666666","2016-04-02 09:03:04","user11");
INSERT INTO windows_ss VALUES ("P066666666666","2016-04-02 09:50:05","user03");
INSERT INTO windows_ss VALUES ("P066666666666","2016-04-02 10:00:00","user51");
INSERT INTO windows_ss VALUES ("P066666666666","2016-04-02 09:10:00","user09");
INSERT INTO windows_ss VALUES ("P066666666666","2016-04-02 09:50:01","user32");
INSERT INTO windows_ss VALUES ("P088888888888","2016-04-02 09:00:02","user41");
INSERT INTO windows_ss VALUES ("P088888888888","2016-04-02 09:00:00","user55");
INSERT INTO windows_ss VALUES ("P088888888888","2016-04-02 09:03:04","user23");
INSERT INTO windows_ss VALUES ("P088888888888","2016-04-02 09:50:05","user80");
INSERT INTO windows_ss VALUES ("P088888888888","2016-04-02 10:00:00","user08");
INSERT INTO windows_ss VALUES ("P088888888888","2016-04-02 09:10:00","user22");
INSERT INTO windows_ss VALUES ("P088888888888","2016-04-02 09:50:01","user31");
select * from windows_ss;
-- 1 LAG 的用法
-- LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
-- 第一个参数为列名,第二个参数为往上第n行(可选,默认为1),
-- 第三个参数为默认值
-- (当往上第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT
polno,
eff_date,
userno,
ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,
LAG(eff_date,1,'1970-01-01 00:00:00') OVER(PARTITION BY polno ORDER BY eff_date) AS last_1_time,
LAG(eff_date,2) OVER(PARTITION BY polno ORDER BY eff_date) AS last_2_time
FROM windows_ss;
-- 2、 LEAD
-- 与LAG相反
-- LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
-- 第一个参数为列名,第二个参数为往下第n行(可选,默认为1),
-- 第三个参数为默认值
-- (当往下第n行为NULL时候,取默认值,如不指定,则为NULL)
SELECT
polno,
eff_date,
userno,
ROW_NUMBER() OVER(PARTITION BY polno ORDER BY eff_date) AS rn,
LEAD(eff_date,1,'1970-01-01 00:00:00') OVER(PARTITION BY polno ORDER BY eff_date) AS next_1_time,
LEAD(eff_date,2) OVER(PARTITION BY polno ORDER BY eff_date) AS next_2_time
FROM windows_ss;
window子句 的 用法
我们在上面已经通过使用partition by子句将数据进行了分组的处理.如果我们想要更细粒度的划分,我们就要引入window子句了.
我们首先要理解两个概念:
- 如果只使用partition by子句,未指定order by的话,我们的聚合是分组内的聚合.
- 使用了order by子句,未使用window子句的情况下,默认从起点到当前行.
当同一个select查询中存在多个窗口函数时,他们相互之间是没有影响的.每个窗口函数应用自己的规则.
window子句:
- PRECEDING:往前
- FOLLOWING:往后
- CURRENT ROW:当前行
- UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
我们按照name进行分区,按照购物时间进行排序,做cost的累加.
如下我们结合使用window子句进行查询
-- 创建表
CREATE TABLE t_window(
uname VARCHAR(200),
orderdate VARCHAR(200),
cost INT
);
select * from t_window;
INSERT INTO t_window VALUES ("jack","2015-01-01",10);
INSERT INTO t_window VALUES ("tony","2015-01-02",15);
INSERT INTO t_window VALUES ("jack","2015-02-03",23);
INSERT INTO t_window VALUES ("tony","2015-01-04",29);
INSERT INTO t_window VALUES ("jack","2015-01-05",46);
INSERT INTO t_window VALUES ("jack","2015-04-06",42);
INSERT INTO t_window VALUES ("tony","2015-01-07",50);
INSERT INTO t_window VALUES ("jack","2015-01-08",55);
INSERT INTO t_window VALUES ("mart","2015-04-08",62);
INSERT INTO t_window VALUES ("mart","2015-04-09",68);
INSERT INTO t_window VALUES ("neil","2015-05-10",12);
INSERT INTO t_window VALUES ("mart","2015-04-11",75);
INSERT INTO t_window VALUES ("neil","2015-06-12",80);
INSERT INTO t_window VALUES ("mart","2015-04-13",94);
select * from t_window;
-- 执行的sql例子
select uname,orderdate,cost,
-- 所有行相加
sum(cost) over() as sample1,
-- 按name分组,组内数据相加
sum(cost) over(partition by uname) as sample2,
-- 按name分组,组内数据累加
sum(cost) over(partition by uname order by orderdate) as sample3,
-- 和sample3一样,由起点到当前行的聚合
sum(cost) over(partition by uname order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample4,
-- 当前行和前面一行做聚合
sum(cost) over(partition by uname order by orderdate rows between 1 PRECEDING and current row) as sample5,
-- 当前行和前边一行及后面一行
sum(cost) over(partition by uname order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING ) as sample6,
-- 当前行及后面所有行
sum(cost) over(partition by uname order by orderdate rows between current row and UNBOUNDED FOLLOWING ) as sample7
from t_window;
-- 取当前时间的函数
select NOW() a1;
select SYSDATE() a2;
select CURDATE() a3;
-- 字符串时间转换的函数
SELECT STR_TO_DATE("2021-09-27", "%Y-%m-%d") as a1;
SELECT DATE_FORMAT("2021-09-27", "%Y-%m-%d") as a2;
mysql,函数讲解:
-- 取当前时间的函数
select NOW() a1;
select SYSDATE() a2;
select CURDATE() a3;
-- 字符串时间转换的函数
SELECT STR_TO_DATE("2021-09-27", "%Y-%m-%d") as a1;
SELECT DATE_FORMAT("2021-09-27", "%Y-%m-%d") as a2;
-- 求时间间隔(函数)
select DATEDIFF("2021-09-26","2021-09-25");
SELECT TIMEDIFF("13:10:11", "13:10:10");
1、MySQL中的两个时间函数,用来做两个时间之间的对比
TIMESTAMPDIFF,(如果当期时间和之前时间的分钟数相比较。大于1天,即等于1;小于1天,则等于0)
select TIMESTAMPDIFF(DAY,'2016-11-16 10:13:42',NOW());
DATEDIFF,(只按2016-11-16计算,不会加小时分钟数,按天计算)
select DATEDIFF(NOW(),'2016-11-16 17:10:52');
2、mysql窗口函数 ,lead的用法
数据库表及数据:
/*
Navicat Premium Data Transfer
Source Server : jack
Source Server Type : MySQL
Source Server Version : 80021
Source Host : localhost:3306
Source Schema : mysql8
Target Server Type : MySQL
Target Server Version : 80021
File Encoding : 65001
Date: 10/11/2021 14:02:08
*/
SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;
-- ----------------------------
-- Table structure for user_order
-- ----------------------------
DROP TABLE IF EXISTS `user_order`;
CREATE TABLE `user_order` (
`user_id` int(0) NOT NULL,
`user_name` varchar(555) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NULL DEFAULT NULL,
`product_name` varchar(555) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NULL DEFAULT NULL,
`buy_time` datetime(0) NULL DEFAULT NULL,
`login_time` datetime(0) NULL DEFAULT NULL,
PRIMARY KEY (`user_id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_as_ci ROW_FORMAT = Dynamic;
-- ----------------------------
-- Records of user_order
-- ----------------------------
INSERT INTO `user_order` VALUES (1, 'jack', '华为手机', '2021-11-09 20:07:25', '2021-11-21 20:00:36');
INSERT INTO `user_order` VALUES (2, 'jack', '华为手机', '2021-10-09 21:17:56', '2021-11-23 20:13:15');
INSERT INTO `user_order` VALUES (3, 'lucy', '小米手机', '2021-08-09 10:07:24', '2021-11-09 20:27:06');
INSERT INTO `user_order` VALUES (4, 'lucy', '小米手机', '2021-09-09 09:07:33', '2021-05-09 03:07:44');
INSERT INTO `user_order` VALUES (5, 'mark', 'oppo手机', '2020-08-08 08:07:28', '2019-06-09 02:19:16');
INSERT INTO `user_order` VALUES (6, 'mark', 'oppo手机', '2017-07-07 07:07:18', '2016-05-09 02:10:16');
INSERT INTO `user_order` VALUES (7, 'jack', '华为手机', '2010-11-09 16:17:15', '2018-10-09 10:17:55');
SET FOREIGN_KEY_CHECKS = 1;
-- -- --- 创建表结束
-- select * from user_order;
-- 1、两次买手机,时间间隔最大的人。
-- 2、两次买手机,时间间隔最大的人。隔了多少天。
-- 3、两次买手机,时间间隔最短的前两名。
-- 4、连续3天,都登录系统的人。
select * from(
select *,TIMESTAMPDIFF(day,buy_time,b_time) as t_day
from
(select *,
lead(tb.buy_time,1) over(PARTITION by tb.user_name ORDER BY tb.buy_time ) as 'b_time'
from
(select *,
ROW_NUMBER() over(PARTITION by t.user_name ORDER BY t.buy_time ) as 'r_n'
from user_order t) tb)tbc)tbcd
where t_day = (
select MAX(t_day) as max_time
from(
select *,TIMESTAMPDIFF(day,buy_time,b_time) as 't_day'
from
(select *,
lead(tb.buy_time,1) over(PARTITION by tb.user_name ORDER BY tb.buy_time ) as 'b_time'
from
(select *,
ROW_NUMBER() over(PARTITION by t.user_name ORDER BY t.buy_time ) as 'r_n'
from user_order t) tb)tbc)tbcd)
-- 3、两次买手机,时间间隔最短的前两名。
select * from
(select *,ROW_NUMBER() over(PARTITION by user_name ORDER BY t_day) as rr_nn
from(
select *,TIMESTAMPDIFF(day,buy_time,b_time) as 't_day'
from
(select *,
lead(tb.buy_time,1) over(PARTITION by tb.user_name ORDER BY tb.buy_time ) as 'b_time'
from
(select *,
ROW_NUMBER() over(PARTITION by t.user_name ORDER BY t.buy_time ) as 'r_n'
from user_order t) tb)tbc)tbcd
where t_day is NOT null ) tbcde
where rr_nn = 1
ORDER BY t_day
LIMIT 2
-- 4、连续3天,都登录系统的人。
select * from(
select *,
lead(t.login_time,1) over(partition by t.user_name ORDER BY t.login_time) as lead_t1,
lead(t.login_time,2) over(partition by t.user_name ORDER BY t.login_time) as lead_t2
from user_order t) tb
where lead_t2 is not null
and
datediff(login_time,lead_t1) = -1
and
datediff(login_time,lead_t2) = -2
SELECT * FROM `user_order`
-- NTILE函数的作用是,将有序数据分成n个桶,记录等级数。
select *,
ROW_NUMBER() over (PARTITION by t.user_name order by t.buy_time) as r_n,
NTILE(2) over (PARTITION by t.user_name order by t.buy_time) as n_e
from user_order t;
-- NTH_VALUE 函数的作用是,窗口函数数组后,将指定那个字段放到那个位置,记录字段赋值记数。
select *,
ROW_NUMBER() over (PARTITION by t.user_name order by t.buy_time) as r_n,
NTH_VALUE(t.login_time,2) over (PARTITION by t.user_name order by t.buy_time) as n_v
from user_order t;
-- -- 分布函数
-- 分布函数1、 PERCENT_RANK()的用法,PERCENT_RANK()=(rank - 1)/(rows - 1)
select *,
ROW_NUMBER() over (PARTITION by t.user_name order by t.buy_time) as r_n,
PERCENT_RANK() over (PARTITION by t.user_name order by t.buy_time) as p_r
from user_order t;
-- 分布函数2、 CUME_DIST()的用法,当前rank值的行数/总行数
select *,
ROW_NUMBER() over (PARTITION by t.user_name order by t.buy_time) as r_n,
CUME_DIST() over (PARTITION by t.user_name order by t.buy_time) as c_d
from user_order t;
-- 头尾函数的应用
-- FIRST_VALUE(expr)的应用:
select *,
ROW_NUMBER() over (PARTITION by t.user_name order by t.buy_time) as r_n,
FIRST_VALUE(t.login_time) over (PARTITION by t.user_name order by t.buy_time) as f_v
from user_order t;
-- LAST_VALUE(expr)的应用:有个bug,如果在窗口函数中对buy_time 或 login_time
-- ORDER BY 排序 LAST_VALUE的功能是失效了。
select *,
ROW_NUMBER() over (PARTITION by t.user_name order by t.buy_time) as r_n,
-- LAST_VALUE(expr)
LAST_VALUE(t.login_time) over (PARTITION by t.user_name ORDER BY t.user_name) as L_v
from user_order t;
select(
select xiao_o from (
select *,
LAST_VALUE(book_id) over (order by r_o desc),
100 - s_o as xiao_o,
ROW_NUMBER() over() as row_n
from (
select *,
sum(book_id) over (ORDER BY book_id) as s_o,
ROW_NUMBER() over() as r_o
from book t) ta
where s_o<=100) tab
where row_n=1
)
-
(
select xiao_o from (
select *,
LAST_VALUE(book_id) over (order by r_o),
s_o - 100 as xiao_o,
ROW_NUMBER() over() as row_n
from (
select *,
sum(book_id) over (ORDER BY book_id) as s_o,
ROW_NUMBER() over() as r_o
from book t) ta
where s_o>100) tab
where row_n=1
)
from dual;