Apache Gremlin电影推荐图数据

目的

利用MovieLens提供的电影数据集,图数据库Neo4j以及图遍历语言Gremlin搭建一个电影推荐引擎。

Apache Gremlin电影推荐

MovieRatings数据集

GulpLeNes研究小组已经制作了一部电影评级语料库。该数据集有3种版本:10万、100万和1000万级。这个文章选用的是100万版本的数据集。点击数据集即可下载。原始数据有三个文件,分别是users.dat, movies.dat, 和ratings.dat. 具体各文件描述可以参照README.txt。

Apache Gremlin

Gremlin Console下载地址。下载后解压安装包apache-tinkerpop-gremlin-console-x.x.x-bin.zip。解压后进入bin目录下执行gremlin.sh启动Gremlin Console。启动成功后如下图:

Gremlin Console

创建电影评级图

电影评级图结构

下面对原始的数据集按进行解析,并按照上图结构将解析后的数据插入Neo4j中,从而建立电影评级图。下面的解析代码可以直接在Gremlin Console中拷贝运行。

创建Neo4j数据库


g = newNeo4jGraph('/tmp/movieRatings')

g.dropIndex("edges")

g.setMaxBufferSize(1000)

occupations = [0:'other', 1:'academic/educator', 2:'artist',

  3:'clerical/admin', 4:'college/grad student', 5:'customer service',

  6:'doctor/health care', 7:'executive/managerial', 8:'farmer',

  9:'homemaker', 10:'K-12 student', 11:'lawyer', 12:'programmer',

  13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed',

  17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer']

解析movies.dat

movies.dat文件中包含电影记录,每行有3列,分别是:电影id、片名、属性,如下所示:

marko$ more -n6 movies.dat 
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy

解析的代码如下,可以直接拷贝到Gremlin Console中运行,如下所示:

new File('movies.dat').eachLine {def line ->
  def components = line.split('::');
  def movieVertex = g.addVertex(['type':'Movie', 'movieId':components[0].toInteger(), 'title':components[1]]);
  components[2].split('\\|').each { def genera ->
    def hits = g.idx(T.v)[[genera:genera]].iterator();
    def generaVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Genera', 'genera':genera]);
    g.addEdge(movieVertex, generaVertex, 'hasGenera');
  }
}

解析users.dat

users.dat文件中包含观众记录,每行有5列,分别是:观众id、性别、年龄、职业、邮编,如下所示:

ml-1m$ more -n6 users.dat 
1::F::1::10::48067
2::M::56::16::70072
3::M::25::15::55117
4::M::45::7::02460
5::M::25::20::55455

解析的代码如下所示:

new File('users.dat').eachLine {def line ->
  def components = line.split('::');
  def userVertex = g.addVertex(['type':'User', 'userId':components[0].toInteger(), 'gender':components[1], 'age':components[2].toInteger()]);
  def occupation = occupations[components[3].toInteger()];
  def hits = g.idx(T.v)[[occupation:occupation]].iterator();
  def occupationVertex = hits.hasNext() ? hits.next() : g.addVertex(['type':'Occupation', 'occupation':occupation]);
  g.addEdge(userVertex, occupationVertex, 'hasOccupation');
}

解析ratings.dat

ratings.dat文件中包含观众对电影的评分记录,每行有4列,分别是:观众id, 电影id, 电影评星 (1-5 星), 时间戳(这里不需要) 如下所示:

ml-1m$ more -n6 ratings.dat 
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291

解析的代码如下所示:

new File('ratings.dat').eachLine {def line ->
  def components = line.split('::');
  def ratedEdge = g.addEdge(g.idx(T.v)[[userId:components[0].toInteger()]].next(), g.idx(T.v)[[movieId:components[1].toInteger()]].next(), 'rated');
  ratedEdge.setProperty('stars', components[2].toInteger());
}

注意:

为确保缓存中的数据完全持久化,请在关闭Gremlin Console之前执行如下代码:

g.stopTransaction(TransactionalGraph.Conclusion.SUCCESS)

在运行推荐算法之前先用如下代码对数据进行检查:

gremlin> g.V.count()
==>9962
gremlin> g.E.count()
==>1012657
gremlin> g.V[[type:'Movie']].count()
==>3883
gremlin> g.V[[type:'Genera']].count()
==>18
gremlin> g.V[[type:'User']].count()  
==>6040
gremlin> g.V[[type:'Occupation']].count()
==>21
gremlin> occupations.size()
==>21

通过如下代码可以查询观众群体的职业分布情况?

gremlin> m = [:]           
gremlin> g.V[[type:'User']].out('hasOccupation').occupation.groupCount(m) >> -1
==>null
gremlin> m.sort{a,b -> b.value <=> a.value}
==>college/grad student=759
==>other=711
==>executive/managerial=679
==>academic/educator=528
==>technician/engineer=502
==>programmer=388
==>sales/marketing=302
==>writer=281
==>artist=267
==>self-employed=241
==>doctor/health care=236
==>K-12 student=195
==>clerical/admin=173
==>scientist=144
==>retired=142
==>lawyer=129
==>customer service=112
==>homemaker=92
==>unemployed=72
==>tradesman/craftsman=70
==>farmer=17

通过如下代码可以查询观众平均年龄:

gremlin> g.V[[type:'User']].age.mean()                                         
==>30.639238410596025

想了解更多图计算的知识,请点击娃娃学软件

你可能感兴趣的:(Apache Gremlin电影推荐图数据)