汇总统计型查询非常有用,甚至可能常常是你的主要工作内容
USE sql_invoicing;
SELECT
MAX(invoice_date) AS latest_date,
-- SELECT选择的不仅可以是列,也可以是数字、列间表达式、列的聚合函数
MIN(invoice_total) lowest,
AVG(invoice_total) average,
SUM(invoice_total * 1.1) total,
COUNT(*) total_records,
COUNT(invoice_total) number_of_invoices,
-- 和上一个相等
COUNT(payment_date) number_of_payments,
-- 【聚合函数会忽略空值】,得到的支付数少于发票数
COUNT(DISTINCT client_id) number_of_distinct_clients
-- DISTINCT client_id 筛掉了该列的重复值,再COUNT计数,会得到不同顾客数
FROM invoices
WHERE invoice_date > '2019-07-01' -- 想只统计下半年的结果
USE sql_invoicing;
SELECT
'1st_half_of_2019' AS date_range,
SUM(invoice_total) AS total_sales,
SUM(payment_total) AS total_payments,
SUM(invoice_total - payment_total) AS what_we_expect
FROM invoices
WHERE invoice_date BETWEEN '2019-01-01' AND '2019-06-30'
UNION
SELECT
'2st_half_of_2019' AS date_range,
SUM(invoice_total) AS total_sales,
SUM(payment_total) AS total_payments,
SUM(invoice_total - payment_total) AS what_we_expect
FROM invoices
WHERE invoice_date BETWEEN '2019-07-01' AND '2019-12-31'
UNION
SELECT
'Total' AS date_range,
SUM(invoice_total) AS total_sales,
SUM(payment_total) AS total_payments,
SUM(invoice_total - payment_total) AS what_we_expect
FROM invoices
WHERE invoice_date BETWEEN '2019-01-01' AND '2019-12-31'
USE sql_invoicing;
SELECT
client_id,
SUM(invoice_total) AS total_sales
……
只有聚合函数是按 client_id 分组时,这里选择 client_id 列才有意义(分组统计语句里SELECT通常都是选择分组依据列+目标统计列的聚合函数,选别的列没意义)。若未分类,结果会是一条总 total_sales 和一条 client_id(该client_id无意义),即 client_id 会被压缩为只显示一条而非 SUM 广播为多条,可以理解为聚合函数比较强势吧。
……
FROM invoices
WHERE invoice_date >= '2019-07-01' -- 筛选,过滤器
GROUP BY client_id -- 分组
ORDER BY invoice_total DESC
若省略排序语句就会默认按分组依据排序(后面一个例子发现好像也不一定,所以最好别省略)
记住语句顺序很重要 WHERE GROUP BY ORDER BY,分组语句在排序语句之前,调换顺序会报错
如前所述,一般分组依据字段也正是 SELECT …… 里的选择字段,如下面例子里的 state 和 city
USE sql_invoicing;
SELECT
state,
city,
SUM(invoice_total) AS total_sales
FROM invoices
JOIN clients USING (client_id)
-- 别忘了USING之后是括号,太容易忘了
GROUP BY state, city
-- 逗号分隔就行
-- 这个例子里 GROUP BY 里去掉 state 结果一样
ORDER BY state
其实上面的例子里一个城市只能属于一个州中,所有归根结底还是算的各城市的销售额,GROUP BY …… 里去掉 state 只写 city (但 SELECT 和 ORDER BY 里保留 state)结果是完全一样的(包括结果里的 state 列),下面这个例子更能说明以多个字段为分组依据进行分组统计的意义
USE sql_invoicing;
SELECT
date,
pm.name AS payment_method,
SUM(amount) AS total_payments
FROM payments p
JOIN payment_methods pm
ON p.payment_method = pm.payment_method_id
GROUP BY date, payment_method
-- 用的是 SELECT 里的列别名
ORDER BY date
USE sql_invoicing;
SELECT
client_id,
SUM(invoice_total) AS total_sales,
COUNT(*/invoice_total/invoice_date) AS number_of_invoices
FROM invoices
GROUP BY client_id
HAVING total_sales > 500 AND number_of_invoices > 5
-- 均为 SELECT 里的列别名
若写:WHERE total_sales > 500 AND number_of_invoices > 5,会报错:Error Code: 1054. Unknown column ‘total_sales’ in ‘where clause’
USE sql_store;
SELECT
c.customer_id,
c.first_name,
c.last_name,
SUM(oi.quantity * oi.unit_price) AS total_sales
FROM customers c
JOIN orders o USING (customer_id) -- 别忘了括号,特容易忘
JOIN order_items oi USING (order_id)
WHERE state = 'VA'
GROUP BY
c.customer_id,
c.first_name,
c.last_name
HAVING total_sales > 100
SELECT state, SUM(points)
FROM customers
GROUP BY state
HAVING SUM(points) > 3000
或
SELECT state
FROM customers
GROUP BY state
HAVING SUM(points) > 3000
USE sql_invoicing;
SELECT
client_id,
SUM(invoice_total)
FROM invoices
GROUP BY client_id WITH ROLLUP
SELECT
state,
city,
SUM(invoice_total) AS total_sales
FROM invoices
JOIN clients USING (client_id)
GROUP BY state, city WITH ROLLUP
USE sql_invoicing;
SELECT
date,
pm.name AS payment_method,
SUM(amount) AS total_payments
FROM payments p
JOIN payment_methods pm
ON p.payment_method = pm.payment_method_id
GROUP BY date, pm.name WITH ROLLUP
SELECT
pm.name AS payment_method,
SUM(amount) AS total
FROM payments p
JOIN payment_methods pm
ON p.payment_method = pm.payment_method_id
GROUP BY pm.name WITH ROLLUP
根据之后三篇参考文章,据说标准的 SQL 查询语句的执行顺序应该是下面这样的:
“SELECT 是在大部分语句执行了之后才执行的,严格的说是在 FROM、WHERE 和 GROUP BY (以及 HAVING)之后执行的。理解这一点是非常重要的,这就是你不能在 WHERE 中使用在 SELECT 中设定别名的字段作为判断条件的原因。”
这个顺序可以由下面这个例子的缩进表现出来(出右往左)(注意 DISTINCT 放不进去了只有以注释的形式展示出来,另外 SELECT 还是选择放在了 HAVING 之前)
SELECT name, SUM(invoice_total) AS total_sales
-- DISTINCT
FROM invoices JOIN clients USING (client_id)
WHERE due_date < '2019-07-01'
GROUP BY name
HAVING total_sales > 150
UNION
SELECT name, SUM(invoice_total) AS total_sales
-- DISTINCT
FROM invoices JOIN clients USING (client_id)
WHERE due_date > '2019-07-01'
GROUP BY name
HAVING total_sales > 150
ORDER BY total_sales
LIMIT 2
关于 SELECT 的位置
1.如后面几篇参考文章所说,按标准 SQL 的执行顺序, SELECT 是在 HAVING 之后
2.但根据前面的内容,似乎在 MySQL 里,SELECT 的执行顺序是在 WHERE GROUP BY 之后,而在 HAVING 之前 —— 因而 WHERE GROUP BY 要用原列名(后来发现只有 WHERE 里必须用原列名,GROUP BY 是原列名或列别名都可用(甚至可以用1,2来指代 SELECT 中的列,不过 Mosh 不建议这样做))而 HAVING 必须用 SELECT 里的列别名(聚合函数除外)
按实践经验来看,就按 2 来记忆和理解是可行的,但之后最好还是要去看书看资料把这个执行顺序的疑惑彻底搞清楚,这个还挺重要的。
USE sql_store;
SELECT *
FROM products
WHERE unit_price > (
SELECT unit_price
FROM products
WHERE product_id = 3
)
MySQL执行时会先执行括号内的子查询(内查询),将获得的生菜价格作为结果返回给外查询
子查询不仅可用在 WHERE …… 中,也可用在 SELECT …… 或 FROM …… 等子句中,本章后面会讲
USE sql_hr;
SELECT *
FROM employees
WHERE salary > (
SELECT AVG(salary)
FROM employees
)
思路:
USE sql_store;
SELECT *
FROM products
WHERE product_id NOT IN (
SELECT DISTINCT product_id
FROM order_items
)
上一节是子查询返回一个值(平均工资),这一节是返回一列数据(被订购过的产品id列表),之后还会用子查询返回一个多列的表
USE sql_invoicing;
SELECT *
FROM clients
WHERE client_id NOT IN (
SELECT DISTINCT client_id
FROM invoices
)
先用子查询查出有过发票记录的顾客名单,作为筛选依据
USE sql_invoicing;
SELECT *
FROM clients
WHERE client_id NOT IN (
SELECT DISTINCT client_id
/*其实这里加不加DISTINCT对子查询返回的结果有影响
但对最后的结果其实没有影响*/
FROM invoices
)
法2. 链接表
用顾客表 LEFT JOIN 发票记录表,再直接在这个合并详情表中筛选出没有发票记录的顾客
USE sql_invoicing;
SELECT DISTINCT client_id, name ……
-- 不能SELECT DISTINCT *
FROM clients
LEFT JOIN invoices USING (client_id)
-- 注意不能用内链接,否则没有发票记录的顾客(我们的目标)直接就被筛掉了
WHERE invoice_id IS NULL
就上面这个案例而言,子查询可读性更好,但有时子查询会过于复杂(嵌套层数过多),用链接表更好(下面的练习就是)。总之在选择方法时,可读性是很重要的考虑因素
USE sql_store;
SELECT customer_id, first_name, last_name
FROM customers
WHERE customer_id IN (
-- 子查询2:从订单表中找出哪些顾客买过生菜
SELECT customer_id
FROM orders
WHERE order_id IN (
-- 子查询1:从订单项目表中找出哪些订单包含生菜
SELECT DISTINCT order_id
FROM order_items
WHERE product_id = 3
)
)
法2. 混合:子查询 + 表连接
USE sql_store;
SELECT customer_id, first_name, last_name
FROM customers
WHERE customer_id IN (
-- 子查询:哪些顾客买过生菜
SELECT customer_id
FROM orders
JOIN order_items USING (order_id)
-- 表连接:合并订单和订单项目表得到 订单详情表
WHERE product_id = 3
)
法3. 完全表连接
直接链接合并3张表(顾客表、订单表和订单项目表)得到 带顾客信息的订单详情表,该合并表包含我们所需的所有信息,可直接在合并表中用WHERE筛选买过生菜的顾客(注意 DISTINCT 关键字的运用)。
USE sql_store;
SELECT DISTINCT customer_id, first_name, last_name
FROM customers
LEFT JOIN orders USING (customer_id)
LEFT JOIN order_items USING (order_id)
WHERE product_id = 3
这个案例中,先将所需信息所在的几张表全部连接合并成一张大表再来查询筛选明显比层层嵌套的多重子查询更加清晰明了
> (MAX (……))
和 > ALL(……)
等效可互换USE sql_invoicing;
SELECT *
FROM invoices
WHERE invoice_total > (
SELECT MAX(invoice_total)
FROM invoices
WHERE client_id = 3
)
法2. 用ALL关键字
USE sql_invoicing;
SELECT *
FROM invoices
WHERE invoice_total > ALL (
SELECT invoice_total
FROM invoices
WHERE client_id = 3
)
其实就是把内层括号的MAX拿到了外层括号变成ALL:
MAX法是用MAX()返回一个顾客3的最大订单金额,再判断哪些发票的金额比这个值大;
ALL法是先返回顾客3的所有订单金额,是一列值,再用ALL()判断比所有这些金额都大的发票有哪些。
两种方法是完全等效的
> ANY/SOME (……)
与 > (MIN (……))
等效= ANY/SOME (……)
与 IN (……)
等效ANY (……) 与 > (MIN (……)) 等效的例子:sql_invoicing 库中,选出金额大于3号顾客任何发票金额(或最小发票金额) 的发票
USE sql_invoicing;
SELECT *
FROM invoices
WHERE invoice_total > ANY (
SELECT invoice_total
FROM invoices
WHERE client_id = 3
)
或
WHERE invoice_total > (
SELECT MIN(invoice_total)
FROM invoices
WHERE client_id = 3
)
USE sql_invoicing;
SELECT *
FROM clients
WHERE client_id IN ( -- 或 = ANY (
-- 子查询:有2次以上发票记录的顾客
SELECT client_id
FROM invoices
GROUP BY client_id
HAVING COUNT(*) >= 2
)
USE sql_hr;
SELECT *
FROM employees e -- 关键 1
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE office_id = e.office_id -- 关键 2
-- 【子查询表字段不用加前缀,主查询表的字段要加前缀,以此区分】
)
相关子查询很慢,但很强大,也有很多实际运用
USE sql_invoicing;
SELECT *
FROM invoices i
WHERE invoice_total > (
-- 子查询:目前客户的平均发票额
SELECT AVG(invoice_total)
FROM invoices
WHERE client_id = i.client_id
)
USE sql_invoicing;
SELECT *
FROM clients
WHERE client_id IN (
SELECT DISTINCT client_id
FROM invoices
)
法2. 链接表
USE sql_invoicing;
SELECT DISTINCT client_id, name ……
FROM clients
JOIN invoices USING (client_id)
-- 内链接,只留下有过发票记录的客户
第3种方法是用EXISTS运算符实现
USE sql_invoicing;
SELECT *
FROM clients c
WHERE EXISTS (
SELECT */client_id
/* 就这个子查询的目的来说,SELECT的选择不影响结果,
因为EXISTS()函数只根据是否为空返回 TRUE 和 FALSE */
FROM invoices
WHERE client_id = c.client_id
)
这还是个相关子查询,因为在其中引用了主查询的 clients 表。这同样是按照主查询的记录一条条验证执行的。具体说来,对于 clients 表(设置别名为 c)里的每一个顾客,子查询在 invoices 表查找这个人的发票记录( 即 client_id = c.client_id 的发票记录),有就返回相关记录否者返回空,然后 EXISTS() 根据是否为空得到 TRUE 和 FALSE(表示此人有无发票记录),然后主查询凭此确定是否保留此条记录。
对比一下,法1是用子查询返回一个有发票记录的顾客id列表,如(1,3,8 ……),然后用IN运算符来判断,如果子查询表太大,可能返回一个上百万千万甚至上亿的id列表,这个id列表就会很占内存非常影响性能,对于这种子查询会返回一个很大的结果集的情况,用这里的EXIST+相关子查询逐条筛选会更有效率
另外,因为 SELECT() 返回的是 TRUE/FALSE,所以自然也可以加上NOT取反,见下面的练习
USE sql_store;
SELECT *
FROM products
WHERE product_id NOT IN (
SELECT product_id
-- 加不加DISTINCT对最终结果无影响
FROM order_items
)
或
SELECT *
FROM products p
WHERE NOT EXISTS (
SELECT *
FROM order_items
WHERE product_id = p.product_id
)
对于亚马逊这样的大电商来说,如果用IN+子查询法,子查询可能会返回一个百万量级的产品id列表,这种情况还是用EXIST+相关子查询逐条验证法更有效率
不仅 WHERE 筛选条件里可以用子查询,SELECT 选择子句和 FROM 来源表子句也能用子查询,这节课讲 SELECT 子句里的子查询
简单讲就是,SELECT选择语句是用来确定查询结果选择包含哪些字段,每个字段都可以是一个表达式,而每个字段表达式里的元素除了可以是原始的列,具体的数值,也同样可以是其它各种花里胡哨的子查询的结果
任何子查询都是简单查询的嵌套,没什么新东西,只是多了一个层级而已,由内向外地一层层梳理就很清楚
要特别注意记住以子查询方式实现在SELECT中使用同级列别名的方法
USE sql_invoicing;
SELECT
invoice_id,
invoice_total,
(SELECT AVG(invoice_total) FROM invoices) AS invoice_average,
/*不能直接用聚合函数,因为“比较强势”,会压缩聚合结果为一条
用括号+子查询(SELECT AVG(invoice_total) FROM invoices)
将其作为一个数值结果 152.388235 加入主查询语句*/
invoice_total - (SELECT invoice_average) AS difference
/*SELECT表达式里要用原列名,不能直接用别名invoice_average
要用列别名的话用子查询(SELECT 同级的列别名)即可
说真的,感觉这个子查询有点难以理解,但记住会用就行*/
FROM invoices
USE sql_invoicing;
SELECT
client_id,
name,
(SELECT SUM(invoice_total) FROM invoices WHERE client_id = c.client_id) AS total_sales,
-- 要得到【相关】客户的发票总额,要用相关子查询 WHERE client_id = c.client_id
(SELECT AVG(invoice_total) FROM invoices) AS average,
(SELECT total_sales - average) AS difference
/* 如前所述,引用同级的列别名,要加括号和 SELECT,
和前两行子查询的区别是,引用同级的列别名不需要说明来源,
所以没有 FROM …… */
FROM clients c
注意第四个客户的 total_sales 和 difference 都是空值 null
USE sql_invoicing;
SELECT *
FROM (
SELECT
client_id,
name,
(SELECT SUM(invoice_total) FROM invoices WHERE client_id = c.client_id) AS total_sales,
(SELECT AVG(invoice_total) FROM invoices) AS average,
(SELECT total_sales - average) AS difference
FROM clients c
) AS sales_summury
/* 在FROM中使用子查询,即使用 “派生表” 时,
必须给派生表取个别名(不管用不用),这是硬性要求,不写会报错:
Error Code: 1248. Every derived table(派生表、导出表)
must have its own alias */
WHERE total_sales IS NOT NULL
复杂的子查询再嵌套进 FROM 里会让整个查询看起来过于复杂,上面这个最好是将子查询结果储存为叫 sales_summury 的视图,然后再直接使用该视图作为来源表,之后会讲。
内置的用来处理数值、文本、日期等的函数
SELECT ROUND(5.7365, 2) -- 四舍五入
SELECT TRUNCATE(5.7365, 2) -- 截断
SELECT CEILING(5.2) -- 天花板函数,大于等于此数的最小整数
SELECT FLOOR(5.6) -- 地板函数,小于等于此数的最大整数
SELECT ABS(-5.2) -- 绝对值
SELECT RAND() -- 随机函数,0到1的随机值
SELECT LENGTH('sky') -- 字符串字符个数/长度(LENGTH)
SELECT UPPER('sky') -- 转大写
SELECT LOWER('Sky') -- 转小写
SELECT LTRIM(' Sky')
SELECT RTRIM('Sky ')
SELECT TRIM(' Sky ')
-- 取左边,取右边,取中间
SELECT LEFT('Kindergarden', 4) -- 取左边(LEFT)4个字符
SELECT RIGHT('Kindergarden', 6) -- 取右边(RIGHT)6个字符
SELECT SUBSTRING('Kindergarden', 7, 6)
-- 取中间从第7个开始的长度为6的子串(SUBSTRING)
-- 注意是从第1个(而非第0个)开始计数的
-- 省略第3参数(子串长度)则一直截取到最后
SELECT LOCATE('gar', 'Kindergarden') -- 定位(LOCATE)首次出现的位置
-- 没有的话返回0(其他编程语言大多返回-1,可能因为索引是从0开始的)
-- 这个定位/查找函数依然是不区分大小写的
SELECT REPLACE('Kindergarten', 'garten', 'garden')
USE sql_store;
SELECT CONCAT(first_name, ' ', last_name) AS full_name
-- concatenate v. 连接
FROM customers
SELECT NOW() -- 2020-09-12 08:50:46
SELECT CURDATE() -- current date, 2020-09-12
SELECT CURTIME() -- current time, 08:50:46
SELECT YEAR(NOW()) -- 2020
还有MONTH, DAY, HOUR, MINUTE, SECOND。
SELECT DAYNAME(NOW()) -- Saturday
SELECT MONTHNAME(NOW()) -- September
标准SQL语句有一个类似的函数 EXTRACT(),若需要在不同DBMS中录入代码,最好用EXTRACT():
SELECT EXTRACT(YEAR FROM NOW())
当然第一参数也可以是MONTH, DAY, HOUR ……
总之就是:EXTRACT(单位 FROM 日期时间对象)
USE sql_store;
SELECT *
FROM orders
WHERE YEAR(order_date) = YEAR(now())
SELECT DATE_FORMAT(NOW(), '%M %d, %Y') -- September 12, 2020
-- 格式说明符里,大小写是不同的,这是目前SQL里第一次出现大小写不同的情况
SELECT TIME_FORMAT(NOW(), '%H:%i %p') -- 11:07 AM
SELECT DATE_ADD(NOW(), INTERVAL -1 DAY)
SELECT DATE_SUB(NOW(), INTERVAL 1 YEAR)
NOW() - INTERVAL 1 DAY
NOW() - INTERVAL 1 YEAR
SELECT DATEDIFF('2019-01-01 09:00', '2019-01-05') -- -4
-- 会忽略时间部分,只算日期差异
借助 TIME_TO_SEC 函数计算时间差异
TIME_TO_SEC:计算从 00:00 到某时间经历的秒数
```sql
SELECT TIME_TO_SEC('09:00') -- 32400
SELECT TIME_TO_SEC('09:00') - TIME_TO_SEC('09:02') -- -120
USE sql_store;
SELECT
order_id,
IFNULL(shipper_id, 'Not Assigned') AS shipper
/* If expr1 is not NULL, IFNULL() returns expr1;
otherwise it returns expr2. */
FROM orders
USE sql_store;
SELECT
order_id,
COALESCE(shipper_id, comments, 'Not Assigned') AS shipper
/* Returns the first non-NULL value in the list,
or NULL if there are no non-NULLvalues. */
FROM orders
COALESCE 函数是返回一系列值中的首个非空值,更灵活
(coalesce vi. 合并;结合;联合)
USE sql_store;
SELECT
CONCAT(first_name, ' ', last_name) AS customer,
IFNULL/COALESCE(phone, 'Unknown') AS phone
FROM customers
USE sql_store;
SELECT
*,
IF(YEAR(order_date) = YEAR(NOW()),
'Active',
'Archived') AS category
FROM orders
USE sql_store;
SELECT
product_id,
name,
COUNT(*) AS orders,
IF(COUNT(*) = 1, 'Once', 'Many times') AS frequency
/* 因为之后的内连接筛选掉了无订单的商品,
所以这里不变考虑次数为0的情况 */
FROM products
JOIN order_items USING(product_id)
GROUP BY product_id
另外,发现如果想用同级列别名orders怎么都不行:
若写成 IF(orders = 1, ‘Once’, ‘Many times’) AS frequency
会报错:Error Code: 1054. Unknown column ‘orders’ in ‘field list’
若写成 IF((SELECT orders) = 1, ‘Once’, ‘Many times’) AS frequency
会报错:Error Code: 1247. Reference ‘orders’ not supported (reference to group function)
CASE
WHEN …… THEN ……
WHEN …… THEN ……
WHEN …… THEN ……
……
[ELSE ……] (ELSE子句是可选的)
END
USE sql_store;
SELECT
order_id,
CASE
WHEN YEAR(order_date) = YEAR(NOW()) THEN 'Active'
WHEN YEAR(order_date) = YEAR(NOW()) - 1 THEN 'Last Year'
WHEN YEAR(order_date) < YEAR(NOW()) - 1 THEN 'Achived'
ELSE 'Future'
END AS 'category'
FROM orders
ELSE ‘Future’ 是可选的,实验发现若分类不完整,比如只写了今年和去年的两个分类条件,则不在这两个分类的记录的 category 字段会是 null.
USE sql_store;
SELECT
CONCAT(first_name, ' ', last_name) AS customer,
points,
CASE
WHEN points < 2000 THEN 'Bronze'
WHEN points BETWEEN 2000 AND 3000 THEN 'Silver'
WHEN points > 3000 THEN 'Gold'
-- ELSE null
END AS category
FROM customers
ORDER BY points DESC
其实也可以用IF嵌套,甚至代码还少些,但感觉没有CASE语句结构清晰、可读性好
SELECT
CONCAT(first_name, ' ', last_name) AS customer,
points,
IF(points < 2000, 'Bronze',
IF(points BETWEEN 2000 AND 3000, 'Silver',
-- 第二层的条件表达式也可以简化为 <= 3000
IF(points > 3000, 'Gold', null))) AS category
FROM customers
ORDER BY points DESC