原文地址:http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
作者:Mike Hillyer
翻译:陈建平 [email protected]
摘要:无限分级的树状结构往往很难处理,作者推荐“嵌套集合模型”方法,可以用简单的SQL完成树状数据的操作,避免了常用的邻接表模型的多次连接查询带来的巨大性能开销。
大部分的开发者都会遇到要在SQL数据库中处理层状数据的问题,也都知道关系数据库其实并不擅长此道。关系数据库中的表并不是层次状的(XML是层次结构),而是扁平的列表。层状数据中的父子关系无法在关系表中自然地表达。
层状数据是一个集合,集合当中的元素都有唯一的父节点和零个或多个的子节点(根节点除外,它无父节点)。层状数据广泛应用于数据库应用系统当中,包括了论坛、邮件列表、商业组织结构、内容管理分类和产品分类等。为了说明问题,我们使用一个虚拟的电子商店的产品分类层次作为例子。
这些产品分类形成一个层状结构,与上面提到的其他应用系统当中的结构类似。在这篇文章当中,我们将在 MySQL 中使用两种处理方式,先使用传统的邻接表模型。
典型的做法下,范例中的分类数据将被存储于如下结构的表中(为了方便读者,我已经包含了完整的CREATE 和 INSERT 命令)
在这种邻接表模式下,表中每一个节点都包含一个指向父节点的指针。这个例子中的最顶层的元素,其父节点为NULL。邻接表模式的优点是很简单,很容易看清父子关系,例如 Flash 是 MP3 PLAYERS(MP3播放器)的子节点,MP3 PLAYERS是PORTABLE ELECTRONICS(便携式电子设备)的子节点,PORTABLE ELECTRONICS 又是 ELECTRONICS (电子设备)的子节点。 邻接表在客户端的代码上比较容易处理,但是后台的服务器要使用纯粹的SQL来工作是有问题的。
处理层状数据时,首先遇到的常见任务是显示整个树,通常要求某种形式的缩进格式。使用SQL时一般的做法是采用自连接:
SELECT t1.name AS lev1, t2.name as lev2, t3.name as lev3, t4.name as lev4 FROM category AS t1 LEFT JOIN category AS t2 ON t2.parent = t1.category_id LEFT JOIN category AS t3 ON t3.parent = t2.category_id LEFT JOIN category AS t4 ON t4.parent = t3.category_id WHERE t1.name = 'ELECTRONICS'; +-------------+----------------------+--------------+-------+ | lev1 | lev2 | lev3 | lev4 | +-------------+----------------------+--------------+-------+ | ELECTRONICS | TELEVISIONS | TUBE | NULL | | ELECTRONICS | TELEVISIONS | LCD | NULL | | ELECTRONICS | TELEVISIONS | PLASMA | NULL | | ELECTRONICS | PORTABLE ELECTRONICS | MP3 PLAYERS | FLASH | | ELECTRONICS | PORTABLE ELECTRONICS | CD PLAYERS | NULL | | ELECTRONICS | PORTABLE ELECTRONICS | 2 WAY RADIOS | NULL | +-------------+----------------------+--------------+-------+ 6 rows in set (0.00 sec)
我们可以使用左连接(LEFT JOIN)查询来获得所有的叶子节点(叶子节点:无子节点的节点)
SELECT t1.name FROM
category AS t1 LEFT JOIN category as t2
ON t1.category_id = t2.parent
WHERE t2.category_id IS NULL;
+--------------+
| name |
+--------------+
| TUBE |
| LCD |
| PLASMA |
| FLASH |
| CD PLAYERS |
| 2 WAY RADIOS |
+--------------+
自连接也可以获得一条完整的节点关系路径。
这中方法的主要局限是,你必须在每个层次上进行自连接,性能自然会随着复杂连接的增加而下降。
使用纯SQL来处理邻接表模型很难做到最好。如果要找到完整的路径,我们必须先知道其所在的层次。其次,还必须特别注意删除操作带来的整棵子树被孤儿化(例如删除 portable electronics 节点,其下的所有子节点都将变成孤儿节点)。 可以通过客户端的代码或者存储过程来处理此类问题。通过程序语言处理,先从树的底层开始向上循环以取得整个树或者单条路径。在删除操作时,可以将子节点的层次提升以及重新排序子节点,让其指向新的父节点,以此来避免孤儿节点的产生。
我在此文中推荐的模型采用一种不同的方法,通常称为“嵌套集合模型”。在此模型中,我们用另一种全新的方式来看层状数据,不再是节点与连线,而是层次嵌套的容器。试着画出我们的电子产品分类:
注意层次结构是怎样保持的,父类包裹着他们的子节点。我们通过为节点添加代表嵌套关系的左、右值,来将结构信息保存到表当中。
CREATE TABLE nested_category (
category_id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(20) NOT NULL,
lft INT NOT NULL,
rgt INT NOT NULL
);
INSERT INTO nested_category
VALUES(1,'ELECTRONICS',1,20),(2,'TELEVISIONS',2,9),(3,'TUBE',3,4),
(4,'LCD',5,6),(5,'PLASMA',7,8),(6,'PORTABLE ELECTRONICS',10,19),
(7,'MP3 PLAYERS',11,14),(8,'FLASH',12,13),
(9,'CD PLAYERS',15,16),(10,'2 WAY RADIOS',17,18);
SELECT * FROM nested_category ORDER BY category_id;
+-------------+----------------------+-----+-----+
| category_id | name | lft | rgt |
+-------------+----------------------+-----+-----+
| 1 | ELECTRONICS | 1 | 20 |
| 2 | TELEVISIONS | 2 | 9 |
| 3 | TUBE | 3 | 4 |
| 4 | LCD | 5 | 6 |
| 5 | PLASMA | 7 | 8 |
| 6 | PORTABLE ELECTRONICS | 10 | 19 |
| 7 | MP3 PLAYERS | 11 | 14 |
| 8 | FLASH | 12 | 13 |
| 9 | CD PLAYERS | 15 | 16 |
| 10 | 2 WAY RADIOS | 17 | 18 |
+-------------+----------------------+-----+-----+
因为 left 和 right 在MySQL当中是保留字,我们使用 lft 和 rgt 来分别表示。(有关MySQL的保留字的全部信息,请参考 http://dev.mysql.com/doc/mysql/en/reserved-words.html )
那我们如何决定左、右值呢? 我们从最左边开始向右,为各个集合的边界标上数字编号,如下图:
这个编号设计套用在树形图上如下:
当我们对树形结构编号时,从左向右,一次一层,每个节点左边编号后,紧接着向其下层子节点编号,然后再为节点的右边编号。这个方法被称为改进的“前序遍历树算法”。
基于子节点的 lft 值始终位于其父节点的 lft 和 rgt 值之间的原理,使用自连接的SQL即可取得整棵树。
SELECT node.name
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND parent.name = 'ELECTRONICS'
ORDER BY node.lft;
+---------------------------+
| name |
+---------------------------+
| ELECTRONICS |
| TELEVISIONS |
| TUBE |
| LCD |
| PLASMA |
| PORTABLE ELECTRONICS |
| MP3 PLAYERS |
| FLASH |
| CD PLAYERS |
| 2 WAY RADIOS |
+--------------------------+
不像以前的邻接表模型的例子,这个查询工作时不关心树的深度问题。在查询语句的 Between 子句中我们不关心节点的 rgt 值,因为 rgt 值总是落在同一父节点中,就像 lft 值一样。
在当前的模型下,查询叶节点比邻接表模型中的 LEFT JOIN 方法更简单。查看 nested_category 表,注意到叶节点的特征是其左右值是连续的,所以只需查询 rgt = lft + 1 的节点即可。
SELECT name FROM nested_category WHERE rgt = lft + 1; +--------------+ | name | +--------------+ | TUBE | | LCD | | PLASMA | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | +--------------+
嵌套集合模型下,无需使用多个自连接,代码:
SELECT parent.name
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND node.name = 'FLASH'
ORDER BY parent.lft;
+--------------------------+
| name |
+--------------------------+
| ELECTRONICS |
| PORTABLE ELECTRONICS |
| MP3 PLAYERS |
| FLASH |
+--------------------------+
我们已经可以显示整棵树了,如果我们想获得节点的深度以便更好地识别其在层次结构中的位置,该如何做呢?这可以通过添加一个 COUNT 函数和 GROUP BY 子句实现:
SELECT node.name, (COUNT(parent.name) - 1) AS depth
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;
+---------------------------+-------+
| name | depth |
+---------------------------+-------+
| ELECTRONICS | 0 |
| TELEVISIONS | 1 |
| TUBE | 2 |
| LCD | 2 |
| PLASMA | 2 |
| PORTABLE ELECTRONICS | 1 |
| MP3 PLAYERS | 2 |
| FLASH | 3 |
| CD PLAYERS | 2 |
| 2 WAY RADIOS | 2 |
+---------------------------+-------+
我们可以使用CONCAT、REPEAT 函数对深度数值操作来缩进我们的分类名称,形成树状层次形式。
SELECT CONCAT( REPEAT(' ', COUNT(parent.name) - 1), node.name) AS name
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;
+----------------------------+
| name |
+----------------------------+
| ELECTRONICS |
| TELEVISIONS |
| TUBE |
| LCD |
| PLASMA |
| PORTABLE ELECTRONICS |
| MP3 PLAYERS |
| FLASH |
| CD PLAYERS |
| 2 WAY RADIOS |
+---------------------------+
当然,在客户端程序当中,你可能会直接利用深度值来显示层次结构。Web 开发者可以循环处理这个结果表,根据深度值的不同添加 <li></li> 和 <ul></ul> 等标记来形成树状样式。
当我们需要子树的深度信息时,我们在自连接中不能限制节点或者父表,因为这会破坏查询结果。我们可以换一种做法,通过添加第三个自连接,和一个子查询,以获得我们子树的新起点。
SELECT node.name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depth FROM nested_category AS node, nested_category AS parent, nested_category AS sub_parent, ( SELECT node.name, (COUNT(parent.name) - 1) AS depth FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.name = 'PORTABLE ELECTRONICS' GROUP BY node.name ORDER BY node.lft )AS sub_tree WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt AND sub_parent.name = sub_tree.name GROUP BY node.name ORDER BY node.lft; +----------------------+-------+ | name | depth | +----------------------+-------+ | PORTABLE ELECTRONICS | 0 | | MP3 PLAYERS | 1 | | FLASH | 2 | | CD PLAYERS | 1 | | 2 WAY RADIOS | 1 | +----------------------+-------+
这个方法中可以使用任何节点的name,包括根节点。深度值总能根据 name 获取到。
假设我们想在一个零售商的网站上显示电子产品的分类。当一个用户点击一个类别时,你想给他显示该类别下的产品,并且显示其直接的子类别,而不是该类下所有的子树,不需要搜索所有的子孙。例如,当点击 PORTABLE ELECTRONICS 时, 我们想展示 MP3 PLAYERS, CD PLAYERS, 和 2 WAY RADIOS,但是不包括 FLASH.
可以通过给以前的查询添加 HAVING 子句来实现:
SELECT node.name, (COUNT(parent.name) - (sub_tree.depth + 1)) AS depth
FROM nested_category AS node,
nested_category AS parent,
nested_category AS sub_parent,
(
SELECT node.name, (COUNT(parent.name) - 1) AS depth
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND node.name = 'PORTABLE ELECTRONICS'
GROUP BY node.name
ORDER BY node.lft
)AS sub_tree
WHERE node.lft BETWEEN parent.lft AND parent.rgt
AND node.lft BETWEEN sub_parent.lft AND sub_parent.rgt
AND sub_parent.name = sub_tree.name
GROUP BY node.name
HAVING depth <= 1
ORDER BY node.lft;
+---------------------------+-------+
| name | depth |
+---------------------------+-------+
| PORTABLE ELECTRONICS | 0 |
| MP3 PLAYERS | 1 |
| CD PLAYERS | 1 |
| 2 WAY RADIOS | 1 |
+---------------------------+-------+
如果不像显示父节点,将 HAVING depth <= 1 修改为 HAVING depth = 1.
我们添加一个表以便示范聚合函数的用法
CREATE TABLE product(
product_id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(40),
category_id INT NOT NULL
);
INSERT INTO product(name, category_id) VALUES('20" TV',3),('36" TV',3),
('Super-LCD 42"',4),('Ultra-Plasma 62"',5),('Value Plasma 38"',5),
('Power-MP3 5gb',7),('Super-Player 1gb',8),('Porta CD',9),('CD To go!',9),
('Family Talk 360',10);
SELECT * FROM product;
+------------+-------------------+-------------+
| product_id | name | category_id |
+------------+-------------------+-------------+
| 1 | 20" TV | 3 |
| 2 | 36" TV | 3 |
| 3 | Super-LCD 42" | 4 |
| 4 | Ultra-Plasma 62" | 5 |
| 5 | Value Plasma 38" | 5 |
| 6 | Power-MP3 128mb | 7 |
| 7 | Super-Shuffle 1gb | 8 |
| 8 | Porta CD | 9 |
| 9 | CD To go! | 9 |
| 10 | Family Talk 360 | 10 |
+------------+-------------------+-------------+
现在我们写一个获取产品分类树的查询,并统计每样产品的数量:
SELECT parent.name, COUNT(product.name) FROM nested_category AS node , nested_category AS parent, product WHERE node.lft BETWEEN parent.lft AND parent.rgt AND node.category_id = product.category_id GROUP BY parent.name ORDER BY node.lft; +----------------------+---------------------+ | name | COUNT(product.name) | +----------------------+---------------------+ | ELECTRONICS | 10 | | TELEVISIONS | 5 | | TUBE | 2 | | LCD | 1 | | PLASMA | 2 | | PORTABLE ELECTRONICS | 5 | | MP3 PLAYERS | 2 | | FLASH | 1 | | CD PLAYERS | 2 | | 2 WAY RADIOS | 1 | +----------------------+---------------------+
这就是我们典型的包含 COUNT 和 GROUP BY 的完整树查询,在 WHERE 子句中有指向产品表的引用和树节点与产品表的连接。正如你所见,这个统计包含每个类别数量统计,并且每个类别也都包含了其子类的统计值。
如果我们想在 TELEVISIONS 和 PORTABLE ELECTRONICS 节点之间添加新节点,新节点的左右值将分别是 10 和 11。添加此节点,则其右边的所有节点的左右值都必须加2。这个过程可以通过 MySQL 5 的存储过程完成。我假设大部分读者现在还在使用4.1版,这是最后一个稳定的MySQL版本(译者:此文章较早,现在5.0 以后的版本早已稳定)。所以我还是使用一个独立的查询,以 LOCK TABLES 命令代替:
LOCK TABLE nested_category WRITE; SELECT @myRight := rgt FROM nested_category WHERE name = 'TELEVISIONS'; UPDATE nested_category SET rgt = rgt + 2 WHERE rgt > @myRight; UPDATE nested_category SET lft = lft + 2 WHERE lft > @myRight; INSERT INTO nested_category(name, lft, rgt) VALUES('GAME CONSOLES', @myRight + 1, @myRight + 2); UNLOCK TABLES; We can then check our nesting with our indented tree query: SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name FROM nested_category AS node, nested_category AS parent WHERE node.lft BETWEEN parent.lft AND parent.rgt GROUP BY node.name ORDER BY node.lft; +-----------------------+ | name | +-----------------------+ | ELECTRONICS | | TELEVISIONS | | TUBE | | LCD | | PLASMA | | GAME CONSOLES | | PORTABLE ELECTRONICS | | MP3 PLAYERS | | FLASH | | CD PLAYERS | | 2 WAY RADIOS | +-----------------------+
如果要为叶节点添加子节点,则需要对代码稍作修改。此处我们添加给 2 WAY RADIOS 节点下添加一个新的 FRS 节点:
LOCK TABLE nested_category WRITE;
SELECT @myLeft := lft FROM nested_category
WHERE name = '2 WAY RADIOS';
UPDATE nested_category SET rgt = rgt + 2 WHERE rgt > @myLeft;
UPDATE nested_category SET lft = lft + 2 WHERE lft > @myLeft;
INSERT INTO nested_category(name, lft, rgt) VALUES('FRS', @myLeft + 1, @myLeft + 2);
UNLOCK TABLES;
在这个例子中,我们对新的父节点的左值以右的所有 lft、rgt 值都进行了增加,然后将我们的新节点放置在此父节点左值的右边,现在我们的新节点已经正确嵌套了:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;
+----------------------------+
| name |
+----------------------------+
| ELECTRONICS |
| TELEVISIONS |
| TUBE |
| LCD |
| PLASMA |
| GAME CONSOLES |
| PORTABLE ELECTRONICS |
| MP3 PLAYERS |
| FLASH |
| CD PLAYERS |
| 2 WAY RADIOS |
| FRS |
+---------------------------+
最后一个基本任务是节点的移除。此动作的过程决定于节点的位置。删除叶节点比其它节点容易,其他节点需要考虑节点孤儿化的问题。
删除叶节点时,与添加的过程相反:
LOCK TABLE nested_category WRITE;
SELECT @myLeft := lft, @myRight := rgt, @myWidth := rgt - lft + 1
FROM nested_category
WHERE name = 'GAME CONSOLES';
DELETE FROM nested_category WHERE lft BETWEEN @myLeft AND @myRight;
UPDATE nested_category SET rgt = rgt - @myWidth WHERE rgt > @myRight;
UPDATE nested_category SET lft = lft - @myWidth WHERE lft > @myRight;
UNLOCK TABLES;
运行完毕后,我们再检查一次缩进的树,确认一下我们的删除动作是否破坏了层次结构:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;
+-----------------------+
| name |
+-----------------------+
| ELECTRONICS |
| TELEVISIONS |
| TUBE |
| LCD |
| PLASMA |
| PORTABLE ELECTRONICS |
| MP3 PLAYERS |
| FLASH |
| CD PLAYERS |
| 2 WAY RADIOS |
| FRS |
+-----------------------+
下面的方法可以删除节点及其所有子节点:
LOCK TABLE nested_category WRITE;
SELECT @myLeft := lft, @myRight := rgt, @myWidth := rgt - lft + 1
FROM nested_category
WHERE name = 'MP3 PLAYERS';
DELETE FROM nested_category WHERE lft BETWEEN @myLeft AND @myRight;
UPDATE nested_category SET rgt = rgt - @myWidth WHERE rgt > @myRight;
UPDATE nested_category SET lft = lft - @myWidth WHERE lft > @myRight;
UNLOCK TABLES;
我们再检查一下删除子树是否成功:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;
+-----------------------+
| name |
+-----------------------+
| ELECTRONICS |
| TELEVISIONS |
| TUBE |
| LCD |
| PLASMA |
| PORTABLE ELECTRONICS |
| CD PLAYERS |
| 2 WAY RADIOS |
| FRS |
+-----------------------+
在另一个情况下,我们只想删除父节点,但是并不想删除其下的子节点。有时候也许只需要将节点的名字修改为一个占位符,以便以后替换,例如一个主管被解雇了。
有些情况是,父节点被删除,子节点将被移动到这个被删除的父节点的层次上:
LOCK TABLE nested_category WRITE;
SELECT @myLeft := lft, @myRight := rgt, @myWidth := rgt - lft + 1
FROM nested_category
WHERE name = 'PORTABLE ELECTRONICS';
DELETE FROM nested_category WHERE lft = @myLeft;
UPDATE nested_category SET rgt = rgt - 1, lft = lft - 1 WHERE lft BETWEEN @myLeft AND @myRight;
UPDATE nested_category SET rgt = rgt - 2 WHERE rgt > @myRight;
UPDATE nested_category SET lft = lft - 2 WHERE lft > @myRight;
UNLOCK TABLES;
我们再检查一次以确认节点是否被提升层次:
SELECT CONCAT( REPEAT( ' ', (COUNT(parent.name) - 1) ), node.name) AS name
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;
+-----------------+
| name |
+-----------------+
| ELECTRONICS |
| TELEVISIONS |
| TUBE |
| LCD |
| PLASMA |
| CD PLAYERS |
| 2 WAY RADIOS |
| FRS |
+-----------------+
其他的情况还包括删除父节点后,提升一个子节点当父节点,其他兄弟节点被移动到新的父节点下。因篇幅关系,此文不再赘述。
希望这篇文章对读者有用,SQL的嵌套集合的概念已有超过10年的历史了,还有一些更进一步的信息可以在一些专著和网络上找到。 就我的理解,处理层状数据模型最有价值的专著是 Joe Celko's Trees and Hierarchies in SQL for Smarties , 作者是高级SQL领域里很受尊敬的Joe Celko, 他是一位多产和备受赞誉的技术作者。Celko 的专著对我的研究和学习来说是无价之宝,我极力推荐。他的书中也包含了其他很多高级主题,本文并未涉及,包括邻接表、嵌套集合以外的其他处理层状数据的方法。
在参考文献和资源部分,我列出了一些Web资源,也许对于读者研究层状数据的处理会有帮助,包括一套PHP的处理MySQL嵌套模型的库。在 Storing Hierarchical Data in a Database 等文章当中也可以找到一些在两种方法之间进行转换的代码。