目的
利用MovieLens提供的电影数据集,图数据库Neo4j以及图遍历语言Gremlin搭建一个电影推荐引擎。
MovieRatings数据集
GulpLeNes研究小组已经制作了一部电影评级语料库。该数据集有3种版本:10万、100万和1000万级。这个文章选用的是100万版本的数据集。点击数据集即可下载。原始数据有三个文件,分别是users.dat, movies.dat, 和ratings.dat. 具体各文件描述可以参照README.txt。
Apache Gremlin
Gremlin Console下载地址。下载后解压安装包apache-tinkerpop-gremlin-console-x.x.x-bin.zip。解压后进入bin目录下执行gremlin.sh启动Gremlin Console。启动成功后如下图:
创建电影评级图
下面对原始的数据集按进行解析,并按照上图结构将解析后的数据插入Neo4j中,从而建立电影评级图。下面的解析代码可以直接在Gremlin Console中拷贝运行。
创建Neo4j数据库
g = newNeo4jGraph('/tmp/movieRatings')
g.dropIndex("edges")
g.setMaxBufferSize(1000)
occupations = [0:'other', 1:'academic/educator', 2:'artist',
3:'clerical/admin', 4:'college/grad student', 5:'customer service',
6:'doctor/health care', 7:'executive/managerial', 8:'farmer',
9:'homemaker', 10:'K-12 student', 11:'lawyer', 12:'programmer',
13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed',
17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer']
解析movies.dat
movies.dat文件中包含电影记录,每行有3列,分别是:电影id、片名、属性,如下所示:
marko$ more -n6 movies.dat
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
解析的代码如下,可以直接拷贝到Gremlin Console中运行,如下所示:
new File('movies.dat').eachLine {def line ->
def components = line.split('::');
def movieVertex = g.addVertex(['type':'Movie', 'movieId':components[0].toInteger(), 'title':components[1]]);
components[2].split('\\|').each { def genera ->
def hits = g.idx(T.v)[[genera:genera]].iterator();
def generaVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Genera', 'genera':genera]);
g.addEdge(movieVertex, generaVertex, 'hasGenera');
}
}
解析users.dat
users.dat文件中包含观众记录,每行有5列,分别是:观众id、性别、年龄、职业、邮编,如下所示:
ml-1m$ more -n6 users.dat
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455
解析的代码如下所示:
new File('users.dat').eachLine {def line ->
def components = line.split('::');
def userVertex = g.addVertex(['type':'User', 'userId':components[0].toInteger(), 'gender':components[1], 'age':components[2].toInteger()]);
def occupation = occupations[components[3].toInteger()];
def hits = g.idx(T.v)[[occupation:occupation]].iterator();
def occupationVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Occupation', 'occupation':occupation]);
g.addEdge(userVertex, occupationVertex, 'hasOccupation');
}
解析ratings.dat
ratings.dat文件中包含观众对电影的评分记录,每行有4列,分别是:观众id, 电影id, 电影评星 (1-5 星), 时间戳(这里不需要) 如下所示:
ml-1m$ more -n6 ratings.dat
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
解析的代码如下所示:
new File('ratings.dat').eachLine {def line ->
def components = line.split('::');
def ratedEdge = g.addEdge(g.idx(T.v)[[userId:components[0].toInteger()]].next(), g.idx(T.v)[[movieId:components[1].toInteger()]].next(), 'rated');
ratedEdge.setProperty('stars', components[2].toInteger());
}
注意:
为确保缓存中的数据完全持久化,请在关闭Gremlin Console之前执行如下代码:
g.stopTransaction(TransactionalGraph.Conclusion.SUCCESS)
在运行推荐算法之前先用如下代码对数据进行检查:
gremlin> g.V.count()
==>9962
gremlin> g.E.count()
==>1012657
gremlin> g.V[[type:'Movie']].count()
==>3883
gremlin> g.V[[type:'Genera']].count()
==>18
gremlin> g.V[[type:'User']].count()
==>6040
gremlin> g.V[[type:'Occupation']].count()
==>21
gremlin> occupations.size()
==>21
通过如下代码可以查询观众群体的职业分布情况?
gremlin> m = [:]
gremlin> g.V[[type:'User']].out('hasOccupation').occupation.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}
==>college/grad student=759
==>other=711
==>executive/managerial=679
==>academic/educator=528
==>technician/engineer=502
==>programmer=388
==>sales/marketing=302
==>writer=281
==>artist=267
==>self-employed=241
==>doctor/health care=236
==>K-12 student=195
==>clerical/admin=173
==>scientist=144
==>retired=142
==>lawyer=129
==>customer service=112
==>homemaker=92
==>unemployed=72
==>tradesman/craftsman=70
==>farmer=17
通过如下代码可以查询观众平均年龄:
gremlin> g.V[[type:'User']].age.mean()
==>30.639238410596025