LPA全称为Label Propagation Algorithm,是一个基于标签传播的非重叠社区发现算法。通过LPA可以对用户群进行聚类,从而实现用户画像。
推荐系统初期,当标签数远远小于未标签数时,传统的监督式学习不适合,可以采用半监督学习,即通过有限的标签传播至未标签的数据。
算法步骤
一些常用的社区发现数据集
gml是一种图描述语言。
以football.gml数据集为例,graph中分为node和egde两部分。
node部分的id为唯一标识,label为标签,value为所属联盟,是一个静态属性。
edge部分为两个结点的id。
graph
[
directed 0
node
[
id 0
label "BrighamYoung"
value 7
]
node
[
id 1
label "FloridaState"
value 0
]
edge
[
source 47
target 2
]
edge
[
source 47
target 39
]
import networkx as nx
from networkx.algorithms import community
import matplotlib.pyplot as plt
# 数据加载
G = nx.read_gml('./football.gml')
# 可视化
nx.draw(G,with_labels=True)
plt.show()
# 社区发现
communities = list(community.label_propagation_communities(G))
print(communities)
print(len(communities))
分为7个类别(聚类个数为动态数据,不确定)
[{'Purdue', 'Toledo', 'Kent', 'Minnesota', 'NorthernIllinois', 'Connecticut', 'Illinois', 'Northwestern', 'MiamiOhio', 'PennState', 'WesternMichigan', 'BowlingGreenState', 'Michigan', 'Marshall', 'Indiana', 'CentralMichigan', 'Akron', 'Iowa', 'Buffalo', 'MichiganState', 'Ohio', 'BallState', 'EasternMichigan', 'Wisconsin', 'OhioState'}, {'Baylor', 'KansasState', 'Kansas', 'Nebraska', 'TexasTech', 'IowaState', 'Colorado', 'Texas', 'Missouri', 'OklahomaState', 'Oklahoma', 'TexasA&M'}, {'Mississippi', 'Florida', 'Tennessee', 'Vanderbilt', 'Alabama', 'Auburn', 'LouisianaState', 'Arkansas', 'Georgia', 'MississippiState', 'Kentucky', 'SouthCarolina'}, {'BoiseState', 'ArkansasState', 'Idaho', 'UtahState', 'SanDiegoState', 'NevadaLasVegas', 'NewMexico', 'NorthTexas', 'BrighamYoung', 'NewMexicoState', 'AirForce', 'ColoradoState', 'Wyoming', 'Utah'}, {'Washington', 'OregonState', 'UCLA', 'WashingtonState', 'California', 'ArizonaState', 'Stanford', 'Oregon', 'Arizona', 'SouthernCalifornia'}, {'LouisianaTech', 'CentralFlorida', 'NorthCarolinaState', 'Temple', 'VirginiaTech', 'NotreDame', 'Houston', 'LouisianaLafayette', 'AlabamaBirmingham', 'FloridaState', 'Cincinnati', 'Clemson', 'SouthernMississippi', 'GeorgiaTech', 'Army', 'Tulane', 'BostonCollege', 'EastCarolina', 'Pittsburgh', 'Louisville', 'MiamiFlorida', 'MiddleTennesseeState', 'Syracuse', 'Virginia', 'LouisianaMonroe', 'NorthCarolina', 'Duke', 'Memphis', 'Navy', 'WestVirginia', 'WakeForest', 'Rutgers', 'Maryland'}, {'FresnoState', 'Rice', 'TexasChristian', 'SouthernMethodist', 'Tulsa', 'Nevada', 'Hawaii', 'SanJoseState', 'TexasElPaso'}]
7
igaph可以处理百万级节点的网络,比networrkx强大。
import igraph
g = igraph.Graph.Read_GML('./football.gml')
igraph.plot(g)
print(g.community_label_propagation())
分为10个类别
Clustering with 115 elements and 10 clusters
[ 0] 0, 4, 9, 16, 23, 41, 93, 104
[ 1] 1, 25, 33, 37, 45, 89, 103, 105, 109
[ 2] 2, 6, 13, 15, 32, 39, 47, 60, 64, 100, 106
[ 3] 3, 5, 10, 40, 52, 72, 74, 81, 84, 98, 102, 107
[ 4] 7, 8, 21, 22, 51, 68, 77, 78, 108, 111
[ 5] 11, 24, 28, 50, 69, 90
[ 6] 12, 14, 18, 26, 31, 34, 38, 42, 43, 54, 61, 71, 85, 99
[ 7] 17, 20, 27, 36, 44, 48, 56, 57, 58, 59, 62, 63, 65, 66, 70, 75, 76, 86,
87, 91, 92, 95, 96, 97, 112, 113
[ 8] 19, 29, 30, 35, 55, 79, 80, 82, 94, 101
[ 9] 46, 49, 53, 67, 73, 83, 88, 110, 114