请列出所有电影的ID,名称和出版国
SELECT title,id,'美国' as country FROM movies
INNER JOIN
只会保留两个表都存在的数据(还记得之前的交集吗),这看起来意味着一些数据的丢失,在某些场景下会有问题.
真实世界中两个表存在差异很正常,所以我们需要更多的连表方式,也就是本节要介绍的左连接LEFT JOIN
,右连接RIGHT JOIN
和 全连接FULL JOIN
. 这几个 连接方式都会保留不能匹配的行。
用LEFT/RIGHT/FULL JOINs 做多表查询
SELECT column, another_column, … FROM mytable **INNER/LEFT/RIGHT/FULL JOIN another_table ON mytable.id = another_table.matching_id** WHERE *condition(s)* ORDER BY column, … ASC/DESC LIMIT num_limit OFFSET num_offset;
和INNER JOIN
语法几乎是一样的. 我们看看这三个连接方法的工作原理:
在表A 连接 B, LEFT JOIN
保留A的所有行,不管有没有能匹配上B 反过来 RIGHT JOIN
则保留所有B里的行。最后FULL JOIN
不管有没有匹配上,同时保留A和B里的所有行
我们还是可以用集合的图示来描述:
LEFT JOIN
RIGHT JOIN
反过来
FULL JOIN
将两个表数据1-1连接,保留A或B的原有行,如果某一行在另一个表不存在,会用 NULL来填充结果数据。所有在用这三个JOIN时,你需要单独处理 NULL
. 关于 NULL
下一节会做更详细的说明
小贴士?
这些Join也可以写作 LEFT OUTER JOIN
, RIGHT OUTER JOIN
, 或 FULL OUTER JOIN
, 和 LEFT JOIN
, RIGHT JOIN
, and FULL JOIN
等价.
我们会用两个新的表雇员表 Employees 和 办公室表 Buildings. 有一些办公室是新的,还没有雇员.
注意:因为我们这个练习DB的限制,只可以用 LEFT JOIN
来解决问题.
Table: Employees (Read-Only)
Role | Name | Building | Years_employed |
---|---|---|---|
Engineer | Becky A. | 1e | 4 |
Engineer | Dan B. | 1e | 2 |
Engineer | Sharon F. | 1e | 6 |
Engineer | Dan M. | 1e | 4 |
Engineer | Malcom S. | 1e | 1 |
Artist | Tylar S. | 2w | 2 |
Artist | Sherman D. | 2w | 8 |
Artist | Jakob J. | 2w | 6 |
Artist | Lillia A. | 2w | 7 |
Artist | Brandon J. | 2w | 7 |
Manager | Scott K. | 1e | 9 |
Manager | Shirlee M. | 1e | 3 |
Manager | Daria O. | 2w | 6 |
Engineer | Yancy I. | null | 0 |
Artist | Oliver P. | null | 0 |
Table: Buildings (Read-Only)
Building_name | Capacity |
---|---|
1e | 24 |
1w | 32 |
2e | 16 |
2w | 20 |
练习 do it — 请完成如下任务
buildings
)名字两种都可以通过。居然没用left,奇怪
① SELECT DISTINCT building FROM employees where building is not null
② SELECT distinct e.Building FROM employees e
join Buildings b
on e.Building= b.Building_name
SELECT * FROM Buildings;
DISTINCT
)开始没想到角色是什么…
SELECT distinct b.building_name, e.Role
FROM buildings b
left join employees e
on b.Building_name = e.Building
buildings
)和对应的容量在上一题的基础上:
SELECT distinct b.*
FROM buildings b
left join employees e
on b.Building_name = e.Building
where e.Role is not null
之前我们已经接触过NULL
. 在数据库中,NULL
表达的是 "无"的概念,或者说没有东西。因为 NULL的存在,我们需要在编写SQL时考虑到某个属性列可能是 NULL的情况, 这种特殊性会造成编写SQL的复杂性,所以没有必要的情况下,我们应该尽量减少 NULL
的使用,让数据中尽可能少出现 NULL
的情况。
如果某个字段你没有填写到数据库,很可能就会出现NULL
。所有一个常见的方式就是为字段设置默认值,比如 数字的默认值设置为0,字符串设置为 ""字符串. 但是在一些NULL
表示它本来含义的场景,需要注意是否设置默认值还是保持NULL
。 (比如, 当你计算一些行的平均值的时候,如果是0会参与计算导致平均值差错,是NULL
则不会参与计算).
还有一些情况很难避免 NULL
的出现, 比如之前说的 outer-joining 多表连接,A和B有数据差异时,必须用 NULL
来填充。这种情况,可以用IS NULL
和 IS NOT NULL
来选在某个字段是否等于 NULL
.
在查询条件中处理 NULL
SELECT column, another_column, …
FROM mytable **
WHERE column IS/IS NOT NULL**
AND/OR *another_condition* AND/OR …;
还是会用到 Employees 和 Buildings 两个表,这一次我们雇佣了一些雇员但还没有分配办公室. 你需要综合之前的知识来解决这些问题!
练习 do it — 请完成如下任务
SELECT name, Role FROM employees
where Building is null;
SELECT b.building_name FROM buildings b
left join employees e
on e.Building = b.Building_name
where e.Role is null;
Table: movies (Read-only)
Id | Title | Director | Year | Length_minutes |
---|---|---|---|---|
1 | Toy Story | John Lasseter | 1995 | 81 |
2 | A Bug’s Life | John Lasseter | 1998 | 95 |
3 | Toy Story 2 | John Lasseter | 1999 | 93 |
4 | Monsters, Inc. | Pete Docter | 2001 | 92 |
5 | Finding Nemo | Finding Nemo | 2003 | 107 |
6 | The Incredibles | Brad Bird | 2004 | 116 |
7 | Cars | John Lasseter | 2006 | 117 |
8 | Ratatouille | Brad Bird | 2007 | 115 |
9 | WALL-E | Andrew Stanton | 2008 | 104 |
10 | Up | Pete Docter | 2009 | 101 |
11 | Toy Story 3 | Lee Unkrich | 2010 | 103 |
12 | Cars 2 | John Lasseter | 2011 | 120 |
13 | Brave | Brenda Chapman | 2012 | 102 |
14 | Monsters University | Dan Scanlon | 2013 | 110 |
Table: boxoffice (Read-only)
Movie_id | Rating | Domestic_sales | International_sales |
---|---|---|---|
5 | 8.2 | 380843261 | 555900000 |
14 | 7.4 | 268492764 | 475066843 |
8 | 8 | 206445654 | 417277164 |
12 | 6.4 | 191452396 | 368400000 |
3 | 7.9 | 245852179 | 239163000 |
6 | 8 | 261441092 | 370001000 |
9 | 8.5 | 223808164 | 297503696 |
11 | 8.4 | 415004880 | 648167031 |
1 | 8.3 | 191796233 | 170162503 |
7 | 7.2 | 244082982 | 217900167 |
10 | 8.3 | 293004164 | 438338580 |
4 | 8.1 | 289916256 | 272900000 |
2 | 7.2 | 162798565 | 200600000 |
13 | 7.2 | 237283207 | 301700000 |
SELECT id,title,(Domestic_sales+International_sales)/1000000 FROM movies m
join boxoffice b
on m.id = b.Movie_id;
Rating
的10倍为市场指数)SELECT id,title,Rating*10 FROM movies m
join boxoffice b
on m.id = b.Movie_id;
SELECT id,title,year FROM movies m
join boxoffice b
on m.id = b.Movie_id
where year%2=0
SELECT title,(Domestic_sales+International_sales)/Length_minutes FROM movies m
join boxoffice b
on b.Movie_id = m.id
where director = 'John Lasseter'
order by (Domestic_sales+International_sales)/Length_minutes desc
limit 3 ;
没想到还要把长度打印出来… length():求字符串长度
SELECT title,length(m.Title),(Domestic_sales+
International_sales) FROM movies m
join boxoffice b
on b.Movie_id = m.id
order by length(m.Title) desc
limit 3
下面介绍几个常用统计函数:
Function | Description |
---|---|
COUNT(*), COUNT(column) | 计数!COUNT(*) 统计数据行数,COUNT(column) 统计column非NULL的行数. |
MIN(column) | 找column最小的一行. |
MAX(column) | 找column最大的一行. |
**AVG(**column) | 对column所有行取平均值. |
SUM(column) | 对column所有行求和. |
分组函数自动忽略null
分组统计
GROUP BY
数据分组语法可以按某个col_name对数据进行分组,如:GROUP BY Year
指对数据按年份分组, 相同年份的分到一个组里。如果把统计函数和GROUP BY
结合,那统计结果就是对分组内的数据统计了.
GROUP BY
分组结果的数据条数,就是分组数量,比如:GROUP BY Year
,全部数据里有几年,就返回几条数据, 不管是否应用了统计函数.
找出就职年份最高的雇员(列出雇员名字+年份)
SELECT name,Years_employed FROM employees
order by Years_employed desc
limit 1
按角色(Role
)统计一下每个角色的平均就职年份
SELECT role,avg(Years_employed) FROM employees
group by role
按办公室名字总计一下就职年份总和
SELECT building,sum(Years_employed) FROM employees
group by building
每栋办公室按人数排名,不要统计无办公室的雇员
观察Building里有两个为null的,去掉它们。分组后用count统计行数
select Building,count(Building) from Employees
where Building is not null
group by Building
就职1,3,5,7年的人分别占总人数的百分比率是多少(给出年份和比率"50%" 记为 50)
①求出总人数select count(*)from employees
②选Years_employed,与其对应的人数
这里第一个count(*)算的是分组之后的对应人数
经过评论区指正,我一开始写的这个是错误的:×
select Years_employed,count(*)*100/(select count(*)from employees) as Rating
from Employees
where Years_employed in (1,3,5,7)
group by Years_employed
正确SQL:
select Years_employed, ROUND(count(*)*1.0*100/(select count(*)from employees) ) as Rating
from Employees
where Years_employed in (1,3,5,7)
group by Years_employed
我们将更加深入的分析 Employee 数据,综合运用不同的SQL语法来解决这些问题.
Role | Name | Building | Years_employed |
---|---|---|---|
Engineer | Becky A. | 1e | 4 |
Engineer | Dan B. | 1e | 2 |
Engineer | Sharon F. | 1e | 6 |
Engineer | Dan M. | 1e | 4 |
Engineer | Malcom S. | 1e | 1 |
Artist | Tylar S. | 2w | 2 |
Artist | Sherman D. | 2w | 8 |
Artist | Jakob J. | 2w | 6 |
Artist | Lillia A. | 2w | 7 |
Artist | Brandon J. | 2w | 7 |
Manager | Scott K. | 1e | 9 |
Manager | Shirlee M. | 1e | 3 |
Manager | Daria O. | 2w | 6 |
Engineer | Yancy I. | null | 0 |
Artist | Oliver P. | null | 0 |
方法①:把Artist的查询结果当作临时表
SELECT count(*) FROM (select * from employees where role = 'Artist');
方法②:按角色role先分组,再求和
select count(role)
from employees
where role = 'Artist'
group by role;
SELECT role ,count(*) FROM employees
group by role;
SELECT sum(Years_employed) FROM employees
where role = 'Engineer'
group by role;
①按照题目要求,按角色、是否有办公室来分组 ②统计case when,如果是null要填0,有则填1。这里第二种方法is not null为1就有办公室
SELECT role,count(*),
case when building is null then'0' else '1' end as have_Building
FROM employees
group by role, building is null
SELECT count(*) as count,Role,building is not null as have_building
FROM employees
group by Role,building is not null
年份分组就是“除以3”
SELECT Role,Years_employed/3 as year_3,count(*) as count
FROM employees
group by Role,year_3
order by count esc
SELECT role,count(*),Years_employed/3
FROM employees
group by role,Years_employed/3 ;
这才是完整的SELECT查询
要记住书写的顺序
SELECT DISTINCT column, AGG_FUNC(column_or_expression), …
FROM mytable
JOIN another_table
ON mytable.column = another_table.column
WHERE constraint_expression
GROUP BY column
HAVING constraint_expression
ORDER BY column ASC/DESC
LIMIT count OFFSET COUNT;
1.
FROM
和JOIN
s
FROM
或JOIN
会第一个执行,确定一个整体的数据范围. 如果要JOIN不同表,可能会生成一个临时Table来用于 下面的过程。总之第一步可以简单理解为确定一个数据源表(含临时表)2.
WHERE
我们确定了数据来源
WHERE
语句就将在这个数据源中按要求进行数据筛选,并丢弃不符合要求的数据行,所有的筛选col属性 只能来自FROM
圈定的表. AS别名还不能在这个阶段使用,因为可能别名是一个还没执行的表达式3.
GROUP BY
如果你用了
GROUP BY
分组,那GROUP BY
将对之前的数据进行分组,统计等,并将是结果集缩小为分组数.这意味着 其他的数据在分组后丢弃.4.
HAVING
如果你用了
GROUP BY
分组,HAVING
会在分组完成后对结果集再次筛选。AS别名也不能在这个阶段使用.5.
SELECT
确定结果之后,
SELECT
用来对结果col简单筛选或计算,决定输出什么数据.6.
DISTINCT
如果数据行有重复
DISTINCT
将负责排重.7.
ORDER BY
在结果集确定的情况下,
ORDER BY
对结果做排序。因为SELECT
中的表达式已经执行完了。此时可以用AS别名.8.
LIMIT
/OFFSET
最后
LIMIT
和OFFSET
从排序的结果中截取部分数据
练习 do it — 请完成如下任务
SELECT Director,count(*) FROM movies m
join boxoffice b
on m.id = b.Movie_id
group by m.Director;
SELECT Director,sum(Domestic_sales+International_sales) FROM movies m
join boxoffice b
on m.id = b.Movie_id
group by m.Director;
①分组group by director;销售总额用sum(Domestic_sales+International_sales)
②**平均销售额**:总销售额/总数
③过滤再排序
SELECT director,count(*),sum(Domestic_sales+International_sales) as sum_sale,
sum(Domestic_sales+International_sales)/count(*) as avg_sale
FROM movies m
left join boxoffice b
on m.id = b.movie_id
group by director
having count(*) > 1
order by avg_sale desc limit 1
①找 单部电影销售冠军值:1063171911
②比较销售差(相减)
SELECT title,
(SELECT (Domestic_sales+
International_sales) FROM movies m
join boxoffice b
on m.id = b.movie_id
order by(Domestic_sales+
International_sales) desc limit 1)
- (Domestic_sales+International_sales) as sale_diff
FROM movies m
left join boxoffice b
on m.id = b.movie_id
order by sale_diff desc //可以不写