2.6. Joins Between Tables
Thus far, our queries have only accessed one table at a time. Queries can access multiple tables at once, or access the same table in such a way that multiple rows of the table are being processed at the same time. A query that accesses multiple rows of the same or different tables at one time is called a join query. As an example, say you wish to list all the weather records together with the location of the associated city. To do that, we need to compare the city column of each row of the weather table with the name column of all rows in the cities table, and select the pairs of rows where these values match.
Note
This is only a conceptual model. The join is usually performed in a more efficient manner than actually comparing each possible pair of rows, but this is invisible to the user.
This would be accomplished by the following query:
SELECT *
FROM weather, cities
WHERE city = name;
city | temp_lo | temp_hi | prcp | date | name
| location
---------------+---------+---------+------+------------
+---------------+-----------
San Francisco | 46 | 50 | 0.25 | 1994-11-27 | San
Francisco | (-194,53)
San Francisco | 43 | 57 | 0 | 1994-11-29 | San
Francisco | (-194,53)
(2 rows)
Observe two things about the result set:
• There is no result row for the city of Hayward. This is because there is no matching entry in the cities table for Hayward, so the join ignores the unmatched rows in the weather table. We will see shortly how this can be fixed.
• There are two columns containing the city name. This is correct because the lists of columns from the weather and cities tables are concatenated. In practice this is undesirable, though, so you will probably want to list the output columns explicitly rather than using *:
SELECT city, temp_lo, temp_hi, prcp, date, location
FROM weather, cities
WHERE city = name;
Exercise: Attempt to determine the semantics of this query when the WHERE clause is omitted.
Since the columns all had different names, the parser automatically found which table they belong to. If there were duplicate column names in the two tables you'd need to qualify the column names to show which one you meant, as in:
SELECT weather.city, weather.temp_lo, weather.temp_hi,
weather.prcp, weather.date, cities.location
FROM weather, cities
WHERE cities.name = weather.city;
It is widely considered good style to qualify all column names in a join query, so that the query won't fail if a duplicate column name is later added to one of the tables.
Join queries of the kind seen thus far can also be written in this alternative form:
SELECT *
FROM weather INNER JOIN cities ON (weather.city = cities.name);
This syntax is not as commonly used as the one above, but we show it here to help you understand the following topics.
Now we will figure out how we can get the Hayward records back in. What we want the query to do is to scan the weather table and for each row to find the matching cities row(s). If no matching row is found we want some “empty values” to be substituted for the cities table's columns. This kind of query is called an outer join. (The joins we have seen so far are inner joins.) The command looks like this:
SELECT *
FROM weather LEFT OUTER JOIN cities ON (weather.city =
cities.name);
city | temp_lo | temp_hi | prcp | date | name
| location
---------------+---------+---------+------+------------
+---------------+-----------
Hayward | 37 | 54 | | 1994-11-29 |
|
San Francisco | 46 | 50 | 0.25 | 1994-11-27 | San
Francisco | (-194,53)
San Francisco | 43 | 57 | 0 | 1994-11-29 | San
Francisco | (-194,53)
(3 rows)
This query is called a left outer join because the table mentioned on the left of the join operator will have each of its rows in the output at least once, whereas the table on the right will only have those rows output that match some row of the left table. When outputting a left-table row for which there is no right-table match, empty (null) values are substituted for the right-table columns.
Exercise: There are also right outer joins and full outer joins. Try to find out what those do.
We can also join a table against itself. This is called a self join. As an example, suppose we wish to find all the weather records that are in the temperature range of other weather records. So we need to compare the temp_lo and temp_hi columns of each weather row to the temp_lo and temp_hi columns of all other weather rows. We can do this with the following query:
SELECT W1.city, W1.temp_lo AS low, W1.temp_hi AS high,
W2.city, W2.temp_lo AS low, W2.temp_hi AS high
FROM weather W1, weather W2
WHERE W1.temp_lo < W2.temp_lo
AND W1.temp_hi > W2.temp_hi;
city | low | high | city | low | high
---------------+-----+------+---------------+-----+------
San Francisco | 43 | 57 | San Francisco | 46 | 50
Hayward | 37 | 54 | San Francisco | 46 | 50
(2 rows)
Here we have relabeled the weather table as W1 and W2 to be able to distinguish the left and right side of the join. You can also use these kinds of aliases in other queries to save some typing, e.g.:
SELECT *
FROM weather w, cities c
WHERE w.city = c.name;
You will encounter this style of abbreviating quite frequently.
2.6 表连接
迄今为止,我们的示例只限定于对一张表的查询。而实际上,查询是可以涉及多个表,或者可以一张表自关联。查询涉及多表或者单表自关联的操作叫做表的连接查询。例如,如果你想同时列出某一城市的坐标及天气情况,那么,需要比较wether表及cities表中城市名称列,并返回符合条件的行。
注:
上面说的只是一个概念模型。实际上,在数据库中执行的表连接查询要比上面所描述的更加有效,不过那是对用户透明的。
上面说的想法,可以通过以下语句来实现:
SELECT *
FROM weather, cities
WHERE city = name;
city | temp_lo | temp_hi | prcp | date | name | location
---------------+---------+---------+------+------------+---------------+-----------
San Francisco | 46 | 50 | 0.25 | 1994-11-27 | SanFrancisco | (-194,53)
San Francisco | 43 | 57 | 0 | 1994-11-29 | SanFrancisco | (-194,53)
(2 rows)
对于结果集,有两点需要注意:
没有返回关于Hayward的行。这是因为在city表中,没有关于Hayward的行,所以查询过滤掉了在wether表中没有匹配的行。之后的介绍会对这个问题进行处理。
有两列包含城市名称。这是没毛病的。因为wether表和city表里面的列连起来了。在实际的应用场景,这种情形是不可取的。所以建议列出明确的表列以取代*:
SELECT city, temp_lo, temp_hi, prcp, date, location
FROM weather, cities
WHERE city = name;
练习:尝试,如果不使用where,如何修改以上语句。
因为上面示例中,两个表里面的列名都不相同,所以解析器会自动解析列属于哪个表。当两个表中有相同的列名的时候,则需要限定列名以明确指出你想要查询的是哪个表的列,例如:
SELECT weather.city, weather.temp_lo, weather.temp_hi,
weather.prcp, weather.date, cities.location
FROM weather, cities
WHERE cities.name = weather.city;
在表连接查询中明确的指定列属于哪个表,是一个好的习惯。因为如果不这样的话,如果后期将重复的列名称添加到了其中一个表中,那么会导致查询失败。
连接查询也可以这样写:
SELECT *
FROM weather INNER JOIN cities ON (weather.city = cities.name);
这种写法其实不太常用。这里指出这种写法,只是为了后面介绍的需要。
现在,我们来介绍获得Hayward城市记录的方法。这里实现的逻辑是:遍历wether表,并在city表中找到匹配行;如果在city表中没有找到匹配行,那么以空来填充。这种查询称为外查询。(之前的查询均为内查询)。命令示例:
SELECT *
FROM weather LEFT OUTER JOIN cities ON (weather.city =cities.name);
city | temp_lo | temp_hi | prcp | date | name| location
---------------+---------+---------+------+------------+---------------+-----------
Hayward | 37 | 54 | | 1994-11-29 ||
San Francisco | 46 | 50 | 0.25 | 1994-11-27 | SanFrancisco | (-194,53)
San Francisco | 43 | 57 | 0 | 1994-11-29 | SanFrancisco | (-194,53)
(3 rows)
这样的外连接叫做左外连接。因为查询结果中,join运算符左边的表中的数据不管是否会匹配,都至少会出现一次,而右边的表,只返回匹配的数据。当左边的表中的数据在右边的表中没有匹配行,则使用空值代替右表中的列值。
练习:尝试探究右外连接和全外连接的运行机制。
表自身也可以进行表连接查询,这叫做自连接。例如,如果我们要查找在其他天气记录中温度范围中的所有天气记录,所以我们需要将每个天气记录中的temp_lo和temp_hi与其他天气行的temp_lo和temp_hi进行比对。语句如下:
SELECT W1.city, W1.temp_lo AS low, W1.temp_hi AS high,
W2.city, W2.temp_lo AS low, W2.temp_hi AS high
FROM weather W1, weather W2
WHERE W1.temp_lo < W2.temp_lo
AND W1.temp_hi > W2.temp_hi;
city | low | high | city | low | high
---------------+-----+------+---------------+-----+------
San Francisco | 43 | 57 | San Francisco | 46 | 50
Hayward | 37 | 54 | San Francisco | 46 | 50
(2 rows)
示例中使用w1和w2作为wether表的别名,以区分连接的左右表。在其他查询中,也可以使用这种方式的别名:
SELECT *
FROM weather w, cities c
WHERE w.city = c.name;
你会发现,这种别名的缩写是很常用的。