原文链接:http://graphaware.com/neo4j/2015/01/16/neo4j-graph-model-design-labels-versus-indexed-properties.html
在设计Neoj图形化数据库的时候,一个通常的问题是:如何处理带标记的数据。比如,标记用户是否活跃,标记博客是否发布,标记文章是否被阅读等。
在SQL中,通常会创建一个boolean型或者tinyint型的列。在Neo4j中,可以通过两种方式达到这种效果:
在多次面临这个困境之后,我们对这两种方式的结果有了深入的了解。在此,将我们的经验分享出来,并给出一些Cypher语句上的优化方案,来帮助你对数据库进行优化。
在下面的博客中,我们会用以下场景来举例:
为了对数据库进行测试,我们利用Graphgen创建了一个小的数据集,并将其导入到图形数据库中。我们使用了下面的语句:
(user:#User *10)
(post:#BlogPost *200)
(post2:#BlogPost:ActivePost {active:{randomElement:["true"]}} *200)
(user)-[:WRITTEN *1..n]->(post)
(user)-[:WRITTEN *1..n]->(post2)
(user)-[:FOLLOWS *n..n]->(user)
当然,我们会首先对User节点中的_id属性加上一个唯一的限制
CREATE CONSTRAINT ON (user:User) ASSERT user._id IS UNIQUE;
现在,我们来查询所有状态未已发布的博客BlogPost节点
PROFILE MATCH (post:ActivePost) RETURN count(post);
会产生下面的执行结果:
neo4j-sh (?)$ PROFILE MATCH (post:ActivePost) RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200 |
+-------------+
1 row
ColumnFilter
|
+EagerAggregation
|
+NodeByLabel
+------------------+------+--------+-------------+--------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+--------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(post) |
| EagerAggregation | 1 | 0 | | |
| NodeByLabel | 200 | 201 | post, post | :ActivePost |
+------------------+------+--------+-------------+--------------------------+
Total database accesses: 201
从结果中可以看到,访问数据库的次数和ActivePost节点的个数一致,通过标签访问的效率很高。
现在,来比较一下通过索引访问的效果。
CREATE INDEX ON :BlogPost(active);
PROFILE MATCH (post:BlogPost) WHERE post.active = true RETURN count(post);
neo4j-sh (?)$ PROFILE MATCH (post:BlogPost) WHERE post.active = 'true' RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200 |
+-------------+
1 row
ColumnFilter
|
+EagerAggregation
|
+SchemaIndex
+------------------+------+--------+-------------+------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(post) |
| EagerAggregation | 1 | 0 | | |
| SchemaIndex | 200 | 201 | post, post | { AUTOSTRING0}; :BlogPost(active) |
+------------------+------+--------+-------------+------------------------------------+
Total database accesses: 201
可以看出,根据索引和根据标签的查询方式效果一致
那么为什么会有这篇博客呢?
当查询节点不是语句的第一个查询的时候(the queried nodes are not in the beginning of the pattern),这两者就会产生明显的区别。
下面,让我们来查询某一个特定用户所有已发布的博客
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:ActivePost)
RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:ActivePost)
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+SchemaIndex
+----------------------+------+--------+----------------------+-----------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+----------------------+-----------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(p) |
| EagerAggregation | 1 | 0 | | |
| Filter | 21 | 21 | | hasLabel(p:ActivePost(2)) |
| SimplePatternMatcher | 21 | 35 | user, p, UNNAMED85 | |
| SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+-----------------------------+
Total database accesses: 58
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:BlogPost)
WHERE p.active = 'true'
RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:BlogPost)
> WHERE p.active = 'true'
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+SchemaIndex
+----------------------+------+--------+----------------------+------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+----------------------+------------------------------------------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(p) |
| EagerAggregation | 1 | 0 | | |
| Filter | 21 | 63 | | (hasLabel(p:BlogPost(1)) AND Property(p,active(8)) == { AUTOSTRING1}) |
| SimplePatternMatcher | 21 | 105 | user, p, UNNAMED85 | |
| SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+------------------------------------------------------------------------+
Total database accesses: 170
可以看见,在第二个查询中,Cypher并没有使用到索引,它需要遍历所有的博客节点来进行查询。这种情况下,使用标签可以有更好的性能。
下面,我们会给出一些提示来帮助你优化Cypher查询。
你也许会问:为什么不加一个草稿Draft标签来表明博客是未发布的呢?
原因在于,这么做的话,当查询已发布博客的时候,需要发起一个负向的筛选。而Cypher中,负向的操作都是很耗时的。
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:WRITTEN]->(p:BlogPost)
WHERE NOT p :Draft
RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:WRITTEN]->(p:BlogPost)
> WHERE NOT p :Draft
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+SchemaIndex
+----------------------+------+--------+----------------------+---------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+----------------------+---------------------------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(p) |
| EagerAggregation | 1 | 0 | | |
| Filter | 21 | 42 | | (hasLabel(p:BlogPost(1)) AND NOT(hasLabel(p:Draft(3)))) |
| SimplePatternMatcher | 21 | 70 | user, p, UNNAMED85 | |
| SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+---------------------------------------------------------+
Total database accesses: 114
在某些设计中,我们可能使用了太多的标签,导致在查询的时候,需要匹配多个标签来获取需要的节点。
而匹配多个标签,会使得Cypher发起多个hasLabel的筛选。
PROFILE MATCH (post:BlogPost:ActivePost) RETURN count(post);
neo4j-sh (?)$ PROFILE MATCH (post:BlogPost:ActivePost) RETURN count(post);
+-------------+
| count(post) |
+-------------+
| 200 |
+-------------+
1 row
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+NodeByLabel
+------------------+------+--------+-------------+------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(post) |
| EagerAggregation | 1 | 0 | | |
| Filter | 200 | 400 | | hasLabel(post:ActivePost(2)) |
| NodeByLabel | 400 | 401 | post, post | :BlogPost |
+------------------+------+--------+-------------+------------------------------+
Total database accesses: 801
可以发现,数据库访问的次数明显上升了。可以想像,当我们有百万个博客节点的时候,性能开销会很高。
在查询语句的时候,给尾节点定义标签会强制Cypher使用标签查询。
设置不同的关系可以避免你匹配过多的标签,从而更方便快速的遍历整个图,提高整体性能。
下面将模型进行调整,加入一个PUBLISHED关系,表明用户已经发布了某个博客。加入一个DRAFTED关系,表明用户未发布的博客。
MATCH (n:ActivePost)
WITH n
MATCH (n)<-[:WRITTEN]-(u)
MERGE (u)-[:PUBLISHED]->(n);
MATCH (n:BlogPost)
WHERE NOT n :ActivePost
WITH n
MATCH (n)<-[:WRITTEN]-(u)
MERGE (u)-[:DRAFTED]->(n);
Neo4j的一个优势在于,节点包含了所有关联于自己的关系。这意味着,只要我们有一个出发点,通过关系找到下一个节点几乎不需要任何开销。
PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
WITH user
MATCH (user)-[:PUBLISHED]->(p)
RETURN count(p);
neo4j-sh (?)$ PROFILE MATCH (user:User {_id:'c922ea0d-45d6-375b-b91a-470933592c2a'})
> WITH user
> MATCH (user)-[:PUBLISHED]->(p)
> RETURN count(p);
+----------+
| count(p) |
+----------+
| 21 |
+----------+
1 row
ColumnFilter
|
+EagerAggregation
|
+SimplePatternMatcher
|
+SchemaIndex
+----------------------+------+--------+----------------------+-----------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+----------------------+-----------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(p) |
| EagerAggregation | 1 | 0 | | |
| SimplePatternMatcher | 21 | 0 | user, p, UNNAMED85 | |
| SchemaIndex | 1 | 2 | user, user | { AUTOSTRING0}; :User(_id) |
+----------------------+------+--------+----------------------+-----------------------------+
Total database accesses: 2
可以看到,我们没有定义任何标签,因为:PUBLISH总是指向已经发布的博客。而通过这种方式,查询上千个已发布的博客,相比与之前使用的索引或标签,开销非常小。
一个精心设计的数据库模型可以大大加快你的查询速度,提供一个良好的用户体验。通过一些测试以及使用PORFILE命令,可以帮你发现性能的瓶颈。在真正投入使用一个大型的数据库之前,这些调优的工作是必不可少的。