声明:知识学习中本文主体按照浙江大学陈华钧教授的《知识图谱》公开课讲义进行介绍,并个别地方加入了自己的注释和思考,希望大家尊重陈华钧教授的知识产权,在使用时加上出处。感谢陈华钧教授。
学识时间:2023年4月18日17:05:32
(1)知识图谱的存储需要综合考虑知识结构、 图的特点、 索引和查询优化等问题。
(2)典型的知识图谱存储引擎分为基于关系数据库的存储和基于原生图的存储。
(3)图数据库存储并非必须, 例如Wikidata项目后端是MySQL实现的。
图结构模型
属性图和RDF图模型都是有向标记图: Directed Labeled Graph
最简单的存储: Triple Store
Property Tables: 属性表存储
(1) 关系模型不善处理“关系”
SELECT p1.Person
FROM Person p1 JOIN PersonFriend
ON PersonFriend.FriendID = p1.ID
JOIN Person p2
ON PersonFriend.PersonID = p2.ID
WHERE p2 .Person = 'Bob'
SELECT p1.Person
FROM Person p1 JOIN PersonFriend
ON PersonFriend,PersonID = p1.ID
JOIN Person p2
ON PersonFriend.FriendID = p2.ID
WHERE p2.Person = 'Bob'
编者注:
依照图表关系,如果通过"person"找此人的Friends很方便,也很快捷,但是如果说列出朋友是Bob的人,那么就要遍历朋友表,一个一个的核对FriendID是否为Bob的ID。这样,如果此表非常大,那么传统的结构化表格的查询就会非常的浪费。
3. 处理Chain relationship 的多跳查询带来更加复杂的Join计算 ,有些数据库如Oracle提供了Syntax Suger如Connect By”, 但没有降低底层的计算复杂度.
SELECT p1.Person AS PERSON,p2.Person AS FRIEND_OF_FRIEND
FROM PersonFriend pfl JOIN Person p1
ON pfl.PersonID = p1.ID
JOIN PersonFriend pf2
ON pf2.PersonID = pfl.FriendID
JOIN Person p2
ON pf2.FriendID = p2.ID
WHERE pl.Person = 'Alice' AND pf2.FriendID <> pl.ID
(2) 处理关联查询
举例一: 找出两个银行账户交易记录的最短路径
举例二: 找出两个社交账户消息发送的最短路径
1,000,000 people, each with approximately 50 friends, query to a maximum depth of five
(3)知识图谱需要更加丰富的关系语义表达与关联推理能力
Relations are first-class citizens
(1)原生图数据库
利用图的结构特征建索引
(2)图数据建模的好处
(3)Neo4J
MATCH (a:Person {name:'Jim'})-[:KNOWS]->(b)-[:KNOWS]->(c),
(a)-[:KNOWS]->(c)
RETURN b, c
SPARQL of W3C:
Select ?a,?b,?c where {
?a :KNOWS ?b.?b :KNOWS ?c.
?a :KNOWS ?c.
?a ISA :Person.
?a :name = "Jim".
}
Gremlin of Apache TinkerPop:
g.V().has(“name","Jim").
out(“:KNOWS").
out(“:KNOWS”).
values("name")
(5)跨领域图建模与查询
CREATE (shakespeare:Author {firstname:'William', lastname:'Shakespeare'}),
(juliusCaesar:Play {title:'Julius Caesar'}),
(shakespeare)-[:WROTE_PLAY {year:1599}]->(juliusCaesar),
(theTempest:Play {title:'The Tempest'}),
(shakespeare)-[:WROTE_PLAY {year:1610}]->(theTempest),
(rsc:Company {name:'RSC'}),
(production1:Production {name:'Julius Caesar'}),
(rsc)-[:PRODUCED]->(production1),
(production1)-[:PRODUCTION_OF]->(juliusCaesar),
(performance1:Performance {date:20120729}),
(performance1)-[:PERFORMANCE_OF]->(production1),
(production2:Production {name:'The Tempest'}),
(rsc)-[:PRODUCED]->(production2),
(production2)-[:PRODUCTION_OF]->(theTempest),
(performance2:Performance {date:20061121}),
(performance2)-[:PERFORMANCE_OF]->(production2),
(performance3:Performance {date:20120730}),
(performance3)-[:PERFORMANCE_OF]->(production1),
(billy:User {name:'Billy'}),
(review:Review {rating:5, review:'This was awesome!'}),
(billy)-[:WROTE_REVIEW]->(review),
(review)-[:RATED]->(performance1),
(theatreRoyal:Venue {name:'Theatre Royal'}),
(performance1)-[:VENUE]->(theatreRoyal),
(performance2)-[:VENUE]->(theatreRoyal),
(performance3)-[:VENUE]->(theatreRoyal),
(greyStreet:Street {name:'Grey Street'}),
(theatreRoyal)-[:STREET]->(greyStreet),
(newcastle:City {name:'Newcastle'}),
(greyStreet)-[:CITY]->(newcastle),
(tyneAndWear:County {name:'Tyne and Wear'}),
(newcastle)-[:COUNTY]->(tyneAndWear),
(england:Country {name:'England'}),
(tyneAndWear)-[:COUNTRY]->(england),
(stratford:City {name:'Stratford upon Avon'}),
(stratford)-[:COUNTRY]->(england),
(rsc)-[:BASED_IN]->(stratford),
(shakespeare)-[:BORN_IN]->stratford
literary domain
theatrical domain
geospatial domain
(6)Cypher图查询举例
Finds all the Shakespeare performances at Newcastle’s Theatre Royal
MATCH (theater:Venue {name:'Theatre Royal'}),
(newcastle:City {name:'Newcastle'}),
(bard:Author {lastname:'Shakespeare'}),
(newcastle)<-[:STREET|CITY*1..2]-(theater)
<-[:VENUE]-()-[:PERFORMANCE_OF]->()
-[:PRODUCTION_OF]->(play)<-[w:WROTE_PLAY]-(bard)
WHERE w.year > 1608
RETURN DISTINCT play.title AS play
Identify potentially suspect behavior:retrieve all the emails that Bob has sent where he’s CC’d one of his own aliases.
MATCH (bob:User {username:'Bob'})-[:SENT]->(email)-[:CC]->(alias),
(alias)-[:ALIAS_OF]->(bob)
RETURN email.id
(1) 免索引邻接
FOR erow IN (select * from employees where X=Y) LOOP
FOR drow IN (select * from departments where erow is matched) LOOP
output values from erow and drow
END LOOP
END LOOP
Hash Join O(M + N):
FOR small table_row IN (SELECT * FROM small_table) LOOP
slot_number := HASH(small_table_row.join_key);
INSERT_HASH_TABLE(slot_number,small_table_row);
END LOOP;
FOR large_table_row IN (SELECT * FROM large_table)
LOOP
slot_number := HASH(large_table_row.join_key);
small_table_row = LOOKUP_HASH_TABLE(slot_number,large_table_row.join_key);
IF small table_row FOUND
THEN
output small_table_row + large_table_row;
END IF;
END LOOP;
Merge Join O(NLog(N) + MLog(M)):
READ data_set_1 SORT BY JOIN KEY TO temp_ds1
READ data set_2 SORT BY JOIN KEY TO temp ds2
READ ds1_row FROM temp_ds1
READ ds2_row FROM temp_ds2
WHILE NOT eof ON temp_ds1,temp_ds2
LOOP
IF ( temp_ds1.key = temp_ds2.key ) OUTPUT JOIN ds1_row,ds2_row
ELSIF ( temp_ds1.key <= temp_ds2.key ) READ ds1_row FROM temp_ds1
ELSIF ( temp_ds1.key => temp_ds2.key ) READ ds2_row FROM temp_ds2
END LOOP
主要优化的点:
1.节点的查找, 关系的查找
2. 从节点到关系, 从关系到节点
3. 从关系到关系
4. 从节点到属性, 从关系到属性
5. 从关系到关系类型.
(1)内联与动态存储
(1)知识图谱存储的选择