DBLP是计算机领域的英文文献数据库,收录了国际期刊和会议等公开发表的论文。DBLP没有提供对中文文献的收录和检索功能,国内类似的权威期刊及重要会议论文集成检索系统有C-DBLP。DBLP是德国特里尔大学的Michael Ley负责开发和维护。它提供计算机领域科学文献的搜索服务,但只储存这些文献的相关元数据,如标题,作者,发表日期等,并使用XML存储元数据。
DBLP数据广泛用于学术研究,如作者主题分析、社区发现、关系推荐、链接预测、作者影响力分析、学术热点研究等。在学术界声誉很高,很多论文及实验都是基于DBLP的。而且更新也很快,每个月初更新一次XML文件,截止至2016.04.12,共收录了330万+的论文、170万+的学者。
<inproceedings mdate="2012-09-18" key="persons/Codd74">
<author>E. F. Coddauthor>
<title>Seven Steps to Rendezvous with the Casual User.title>
<year>1974year>
<booktitle>IFIP Working Conference Data Base Managementbooktitle>
<url>db/conf/ds/dbm74.html#Codd74url>
<note>IBM Research Report RJ 1333, San Jose, Californianote>
inproceedings>
<article mdate="2002-01-03" key="persons/Codd69">
<author>E. F. Coddauthor>
<title>Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks.title>
<journal>IBM Research Report, San Jose, Californiajournal>
<year>1969year>
<ee>db/labs/ibm/RJ599.htmlee>
article>
XML的头文件编码方式是 ISO-8859-1 (“Latin-1”) ,但是文件中的内容的都是ASCII字符,其中拉丁字符被转换成对应的实体,如é表示为& eacute; 。包含类型:article、inproceedings、proceedings、book、incollection、phdthesis、mastersthesis、www。
XML具体介绍可参考【官文的PDF】【DBLP XML数据下载地址】
本文介绍将XML解析出来,然后保存到mysql数据库。
mysql存储数据的表结构:
CREATE TABLE if not exists paper(
id int(11) NOT NULL,
ptag varchar(64) default NULL,
title varchar(512) default NULL,
author varchar(256) default NULL,
subtag varchar(64) default NULL,
sub_detail varchar(512) default NULL,
pyear int(11) default NULL,
url varchar(256) default NULL,
mdate varchar(32) default NULL,
pkey varchar(256) default NULL,
publtype varchar(256) default NULL
)
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import os,sys
import xml.sax
import re
import mysql_util
from mysql_util import mysqlutil
import MySQLdb
reload(sys)
sys.setdefaultencoding('utf-8')
#paper_tags = ('article','inproceedings','proceedings','book', 'incollection','phdthesis','mastersthesis','www')
paper_tags = ('article','inproceedings') ## only parse these tags
sub_tags = ('publisher', 'journal', 'booktitle')
class MovieHandler(xml.sax.ContentHandler):
def __init__(self):
self.id = 1
self.kv = {}
self.reset()
self.util = mysqlutil()
self.params = []
self.batch_len = 10
def reset(self):
self.curtag = None
self.pid = None
self.ptag = None
self.title = None
self.author = None
self.tag = None
self.subtag = None
self.subtext = None
self.year = None
self.url = None
self.mdate = None
self.key = None
self.publtype = None
self.kv = {}
#元素开始事件处理
def startElement(self, tag, attributes):
if tag is not None and len(tag.strip()) > 0:
self.curtag = tag
if tag in paper_tags:
self.reset()
self.pid = self.id
self.kv['ptag'] = str(tag)
self.kv['id'] = self.id
self.id += 1
if attributes.has_key('key'):
self.key = str(attributes['key'])
if attributes.has_key('mdate'):
self.mdate = str(attributes['mdate'])
if attributes.has_key('publtype'):
self.publtype = str(attributes['publtype'])
elif tag in sub_tags:
self.kv['sub_tag'] = str(tag)
# 元素结束事件处理
def endElement(self, tag):
if tag == 'title':
self.kv['title'] = str(self.title)
elif tag == 'author':
self.author = re.sub(' ','_', str(self.author))
if self.kv.has_key('author') == False:
self.kv['author'] = []
self.kv['author'].append(str(self.author))
else:
self.kv['author'].append(str(self.author))
elif tag in sub_tags:
self.kv['sub_detail'] = str(self.subtext)
elif tag == 'url':
self.kv['url'] = str(self.url)
elif tag == 'year':
self.kv['year'] = str(self.year)
elif tag in paper_tags:
tid = int(self.kv['id']) if self.kv.has_key('id') else 0
ptag = self.kv['ptag'] if self.kv.has_key('ptag') else 'NULL'
try:
title = self.kv['title'] if self.kv.has_key('title') else 'NULL'
except Exception, e:
title = ''
author = self.kv['author'] if self.kv.has_key('author') else 'NULL'
author = ','.join(author) if author is not None else 'NULL'
subtag = self.kv['subtag'] if self.kv.has_key('subtag') else 'NULL'
sub_detail = self.kv['sub_detail'] if self.kv.has_key('sub_detail') else 'NULL'
year = self.kv['year'] if self.kv.has_key('year') else 0
url = self.kv['url'] if self.kv.has_key('url') else 'NULL'
mdate = self.kv['mdate'] if self.kv.has_key('mdate') else 'NULL'
pkey = self.kv['pkey'] if self.kv.has_key('pkey') else 'NULL'
publtype = self.kv['publtype'] if self.kv.has_key('publtype') else 'NULL'
param = (str(tid), ptag, title, author, subtag, sub_detail, year, url, mdate, pkey, publtype)
# 只抽取其中的会议论文
if url.find('db/conf') >= 0:
self.params.append(param)
if len(self.params) % self.batch_len == 0:
print len(self.params)
sql = "insert into paper_conf(id, ptag, title, author, subtag, sub_detail, year, url, mdate, pkey, publtype) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
self.util.execute_sql_params(sql, self.params)
self.params = []
# 内容事件处理
def characters(self, content):
if self.curtag == "title":
self.title = content.strip()
elif self.curtag == "author":
self.author = content.strip()
elif self.curtag in sub_tags:
self.subtext = content.strip()
elif self.curtag == "year":
self.year = content.strip()
elif self.curtag == "url":
self.url = content.strip()
## python parser.py dblp-2015-03-02.xml
if __name__ == "__main__":
filename = 'test.xml'
if len(sys.argv) == 2:
filename = sys.argv[1]
if os.path.exists(filename) == False:
print '[%s] not exists!' % filename
exit(1)
# 创建一个 XMLReader
parser = xml.sax.make_parser()
# turn off namepsaces
parser.setFeature(xml.sax.handler.feature_namespaces, 0)
# 重写 ContextHandler
Handler = MovieHandler()
parser.setContentHandler( Handler )
parser.parse(filename)
print 'Parser Complete!'
整个代码:【下载地址, 访问密码:ff52】
共有如下文件:
create_table.sql:创建dblp数据表
parser.py:解析xml–>mysql数据库
mysql_util:连接mysql
gen_data.py:从mysql数据库抽取部分相关的数据
porter_stemmer.py:对文本进行词干化处理
声明:本文仅对相关数据集进行说明,并提供相应的链接,如需转载,请注明本文链接:http://blog.csdn.net/wzgang123/article/details/51131910