本文继续基于上一篇文章,深入研究基于图谱的各类算法,相比传统的关键词搜索,关系连接,全文检索等,基于知识图谱的算法将充分利用知识图谱的实体关系及其属性权重等信息,为大数据分析做支撑,使得数据分析和知识洞察更直观,可解释和快速解决应用诉求,并可快速落地实施。基于知识图谱的图算法主要有中心度,路径搜索、社区发现,Link检测、相似分析和图表示等,详见下图。
另外,在Neo4j 4.0之后,不再支持algo了,推出了Graph Data Science(即GDS),区别于algo算法库,GDS需要根据知识图谱先创建投影图形,投影图的目的是用一个自定义的名称存储在图形目录中,使用该名称,可以被库中的任何算法多次重复引用。这使得多个算法可以使用相同的投影图,而不必在每个算法运行时都对其进行投影。同时,本地投影通过从Neo4j存储文件中读取构建,提供了最佳的性能。建议在开发和生产阶段都使用。
如果大家觉得有帮助,欢迎关注并推荐,谢谢啦!
示例实验环境:
Neo4 5.1.0,Linux7.5,jdk19,plugin有APOC5.1.0、GDS2.2.5等。
中心度算法用于识别图中特定节点的角色及其对网络的影响。
######################################################
投影图脚本
1.创建投影图
单个节点,一种关系
CALL gds.graph.project(
'myGraph4',
'com_aaa',
'has_key',
{
relationshipProperties: 'cost'
}
)
两个节点,两种关系。
CALL gds.graph.project(
'myGraph4',
['com_aaa', 'result'],
['has_key', 'has_key']
)
2.查看投影图是否存在
CALL gds.graph.exists('myGraph4')
YIELD graphName, exists
RETURN graphName, exists
3.删除投影图
CALL gds.graph.drop('myGraph4') YIELD graphName;
#######################################################
1.创建投影图,如多个标签和关系的投影
单个节点,一种关系。
CALL gds.graph.project(
'myGraph4',
'com_aaa',
'has_key'
)
两个节点,两种关系。
CALL gds.graph.project(
'wellResult',
['com_aaa', 'result'],
['has_key', 'has_key']
)
指定一个特定的节点或起点,重点关注这个指定节点的中心度。PageRank算法最初是谷歌推出用来计算网页排名的,简单的说就是,指向这个节点的关系越多,那么这个节点就越重要。
MATCH (source:com_well {name: 'com_aaa'})
CALL gds.pageRank.stream('wellResult',{ maxIterations: 20, dampingFactor: 0.85, sourceNodes: [source] })
YIELD nodeId, score
RETURN gds.util.asNode(nodeId) LIMIT 100
问题:AAA节点的最短路径(蓝色节点),用于发现源节点和目标节点之间的最短通行路径。
MATCH (source:com_aaa{name: 'AAA'}), (target:result)
CALL gds.shortestPath.dijkstra.stream('wellResult', {
sourceNode: source,
targetNode: target
})
YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
RETURN
index,
gds.util.asNode(sourceNode).name AS sourceNodeName,
gds.util.asNode(targetNode).name AS targetNodeName,
totalCost,
[nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,
costs,
nodes(path) as path
ORDER BY index
MATCH (source:com_aaa{name: 'xxx'}), (target:result)
CALL gds.allShortestPaths.delta.stream('wellResult', {
sourceNode: source,
delta: 3.0
})
YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
RETURN
index,
gds.util.asNode(sourceNode).name AS sourceNodeName,
gds.util.asNode(targetNode).name AS targetNodeName,
totalCost,
[nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,
costs,
nodes(path) as path
ORDER BY index
5.深度优先算法(dfs)和广度优先算法(bfs)
快速遍历xxx节点的相关数据。(蓝色节点)
MATCH (source:com_aaa{name: 'xxx'})
CALL gds.dfs.stream('wellResult', {
sourceNode: source
})
YIELD path
RETURN path
6.随机遍历(RandomWalk)
指定节点
MATCH (page:com_aaa)
WHERE page.name IN ['aa-1-1']
WITH COLLECT(page) as sourceNodes
CALL gds.randomWalk.stream(
'wellResult',
{
sourceNodes: sourceNodes,
walkLength: 3,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, [node IN nodes(path) | node.name ] AS pages
MATCH (page:com_aaa)
WHERE page.name IN ['aaa']
WITH COLLECT(page) as sourceNodes
CALL gds.randomWalk.stream(
'miningResult',
{
sourceNodes: sourceNodes,
walkLength: 3,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, path limit 100
未指定节点
CALL gds.randomWalk.stream(
'wellResult',
{
walkLength: 3,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, path LIMIT 20
7.向量化(Embeding)
1).Node2Vec是一种节点嵌入算法,它基于图中的随机行走计算节点的向量表示。邻域是通过随机行走进行采样的。使用一些随机邻域样本,该算法训练了一个单隐层神经网络。该神经网络被训练为根据另一个节点的出现情况来预测一个节点在随机行走中出现的可能性。
2).FastRP快速随机投影,是随机投影算法家族中的一种节点嵌入算法。这些算法在理论上受到Johnsson-Lindenstrauss定理的支持,根据该定理,人们可以将任意维度的n个向量投影到O(log(n))维度,并且仍然近似地保留各点间的成对距离。事实上,一个以随机方式选择的线性投影满足这一特性。
3).GraphSAGE是一种用于计算节点嵌入的归纳算法。GraphSAGE是利用节点特征信息在未见过的节点或图上生成节点嵌入。该算法不是为每个节点训练单独的嵌入,而是学习一个函数,通过从节点的本地邻域采样和聚集特征来生成嵌入。
CALL gds.beta.node2vec.stream('wellResult', {embeddingDimension: 4})
YIELD nodeId, embedding
RETURN gds.util.asNode(nodeId).name, embedding
数据
MERGE (home:Page {name:"Home"})
MERGE (about:Page {name:"About"})
MERGE (product:Page {name:"Product"})
MERGE (links:Page {name:"Links"})
MERGE (a:Page {name:"Site A"})
MERGE (b:Page {name:"Site B"})
MERGE (c:Page {name:"Site C"})
MERGE (d:Page {name:"Site D"})
MERGE (home)-[:LINKS]->(about)
MERGE (about)-[:LINKS]->(home)
MERGE (product)-[:LINKS]->(home)
MERGE (home)-[:LINKS]->(product)
MERGE (links)-[:LINKS]->(home)
MERGE (home)-[:LINKS]->(links)
MERGE (links)-[:LINKS]->(a)
MERGE (a)-[:LINKS]->(home)
MERGE (links)-[:LINKS]->(b)
MERGE (b)-[:LINKS]->(home)
MERGE (links)-[:LINKS]->(c)
MERGE (c)-[:LINKS]->(home)
MERGE (links)-[:LINKS]->(d)
MERGE (d)-[:LINKS]->(home)
1.创建投影图
CALL gds.graph.project(
'myGraph',
'Page',
'LINKS',
{
relationshipProperties: 'weight'
}
)
2.计算运行该算法的成本:主要是内存占用成本等
CALL gds.pageRank.write.estimate('myGraph', {
writeProperty: 'pageRank',
maxIterations: 20,
dampingFactor: 0.85
})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
指定一个特定的起点,重点关注这个指定点的中心度
MATCH (siteA:Page {name: 'Site A'})
CALL gds.pageRank.stream('myGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId), score
ORDER BY score DESC
4.RandomWalk
CALL gds.graph.project( 'myGraph2', 'Page', { LINKS: { orientation: 'UNDIRECTED' } } );
未指定源(sourceNodes)
CALL gds.randomWalk.stream(
'myGraph',
{
walkLength: 3,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, [node IN nodes(path) | node.name ] AS pages
指定源(sourceNodes)
MATCH (page:Page)
WHERE page.name IN ['Home', 'About']
WITH COLLECT(page) as sourceNodes
CALL gds.randomWalk.stream(
'myGraph',
{
sourceNodes: sourceNodes,
walkLength: 3,
walksPerNode: 1,
randomSeed: 42,
concurrency: 1
}
)
YIELD nodeIds, path
RETURN nodeIds, [node IN nodes(path) | node.name ] AS pages
统计成本
CALL gds.randomWalk.stats( 'myGraph', { walkLength: 3, walksPerNode: 1, randomSeed: 42, concurrency: 1 } )
数据
CREATE (a:Location {name: 'A'}),
(b:Location {name: 'B'}),
(c:Location {name: 'C'}),
(d:Location {name: 'D'}),
(e:Location {name: 'E'}),
(f:Location {name: 'F'}),
(a)-[:ROAD {cost: 50}]->(b),
(a)-[:ROAD {cost: 50}]->(c),
(a)-[:ROAD {cost: 100}]->(d),
(b)-[:ROAD {cost: 40}]->(d),
(c)-[:ROAD {cost: 40}]->(d),
(c)-[:ROAD {cost: 80}]->(e),
(d)-[:ROAD {cost: 30}]->(e),
(d)-[:ROAD {cost: 80}]->(f),
(e)-[:ROAD {cost: 40}]->(f);
1.创建投影图
CALL gds.graph.project(
'myGraph',
'Location',
'ROAD',
{
relationshipProperties: 'cost'
}
)
2.计算运行成本
MATCH (source:Location {name: 'A'}), (target:Location {name: 'F'})
CALL gds.shortestPath.dijkstra.write.estimate('myGraph', {
sourceNode: source,
targetNode: target,
relationshipWeightProperty: 'cost',
writeRelationshipType: 'PATH'
})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
RETURN nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
3.shortestPath算法
MATCH (source:Location {name: 'A'}), (target:Location {name: 'F'})
CALL gds.shortestPath.dijkstra.stream('myGraph', {
sourceNode: source,
targetNode: target,
relationshipWeightProperty: 'cost'
})
YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
RETURN
index,
gds.util.asNode(sourceNode).name AS sourceNodeName,
gds.util.asNode(targetNode).name AS targetNodeName,
totalCost,
[nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS nodeNames,
costs,
nodes(path) as path
ORDER BY index
数据
CREATE (alice:People{name: 'Alice'})
CREATE (bob:People{name: 'Bob'})
CREATE (carol:People{name: 'Carol'})
CREATE (dave:People{name: 'Dave'})
CREATE (eve:People{name: 'Eve'})
CREATE (guitar:Instrument {name: 'Guitar'})
CREATE (synth:Instrument {name: 'Synthesizer'})
CREATE (bongos:Instrument {name: 'Bongos'})
CREATE (trumpet:Instrument {name: 'Trumpet'})
CREATE (alice)-[:LIKES]->(guitar)
CREATE (alice)-[:LIKES]->(synth)
CREATE (alice)-[:LIKES]->(bongos)
CREATE (bob)-[:LIKES]->(guitar)
CREATE (bob)-[:LIKES]->(synth)
CREATE (carol)-[:LIKES]->(bongos)
CREATE (dave)-[:LIKES]->(guitar)
CREATE (dave)-[:LIKES]->(synth)
CREATE (dave)-[:LIKES]->(bongos);
1.投影图
CALL gds.graph.project('myGraph2', ['People', 'Instrument'], 'LIKES');
2.Node2Vec向量化
CALL gds.beta.node2vec.stream('myGraph2', {embeddingDimension: 4})
YIELD nodeId, embedding
RETURN gds.util.asNode(nodeId).name, embedding