DBLP实验数据集处理

      • DBLP介绍
      • XML数据格式
      • 解析XML

DBLP介绍

DBLP是计算机领域的英文文献数据库,收录了国际期刊和会议等公开发表的论文。DBLP没有提供对中文文献的收录和检索功能,国内类似的权威期刊及重要会议论文集成检索系统有C-DBLP。DBLP是德国特里尔大学的Michael Ley负责开发和维护。它提供计算机领域科学文献的搜索服务,但只储存这些文献的相关元数据,如标题,作者,发表日期等,并使用XML存储元数据。

DBLP数据广泛用于学术研究,如作者主题分析、社区发现、关系推荐、链接预测、作者影响力分析、学术热点研究等。在学术界声誉很高,很多论文及实验都是基于DBLP的。而且更新也很快,每个月初更新一次XML文件,截止至2016.04.12,共收录了330万+的论文、170万+的学者。

XML数据格式

<inproceedings mdate="2012-09-18" key="persons/Codd74">
    <author>E. F. Coddauthor>
    <title>Seven Steps to Rendezvous with the Casual User.title>
    <year>1974year>
    <booktitle>IFIP Working Conference Data Base Managementbooktitle>
    <url>db/conf/ds/dbm74.html#Codd74url>
    <note>IBM Research Report RJ 1333, San Jose, Californianote>
inproceedings>
<article mdate="2002-01-03" key="persons/Codd69">
    <author>E. F. Coddauthor>
    <title>Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks.title>
    <journal>IBM Research Report, San Jose, Californiajournal>
    <year>1969year>
    <ee>db/labs/ibm/RJ599.htmlee>
article>

XML的头文件编码方式是 ISO-8859-1 (“Latin-1”) ,但是文件中的内容的都是ASCII字符,其中拉丁字符被转换成对应的实体,如é表示为& eacute; 。包含类型:article、inproceedings、proceedings、book、incollection、phdthesis、mastersthesis、www。
XML具体介绍可参考【官文的PDF】【DBLP XML数据下载地址】
本文介绍将XML解析出来,然后保存到mysql数据库。

mysql存储数据的表结构:

CREATE TABLE if not exists paper(
    id int(11) NOT NULL,
    ptag varchar(64) default NULL,
    title varchar(512) default NULL,
    author varchar(256) default NULL,
    subtag varchar(64) default NULL,
    sub_detail varchar(512) default NULL,
    pyear int(11) default NULL,
    url varchar(256) default NULL,
    mdate varchar(32) default NULL,
    pkey varchar(256) default NULL,
    publtype varchar(256) default NULL
)

解析XML

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import os,sys
import xml.sax
import re
import mysql_util
from mysql_util import mysqlutil
import MySQLdb

reload(sys)
sys.setdefaultencoding('utf-8')

#paper_tags = ('article','inproceedings','proceedings','book', 'incollection','phdthesis','mastersthesis','www')
paper_tags = ('article','inproceedings') ## only parse these tags
sub_tags = ('publisher', 'journal', 'booktitle')

class MovieHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.id = 1
        self.kv = {}
        self.reset()
        self.util = mysqlutil()
        self.params = []
        self.batch_len = 10

    def reset(self):
        self.curtag = None
        self.pid = None
        self.ptag = None
        self.title = None
        self.author = None
        self.tag = None
        self.subtag = None
        self.subtext = None
        self.year = None
        self.url = None
        self.mdate = None
        self.key = None
        self.publtype = None
        self.kv = {}

    #元素开始事件处理
    def startElement(self, tag, attributes):
        if tag is not None and len(tag.strip()) > 0:
            self.curtag = tag

            if tag in paper_tags:
                self.reset()
                self.pid = self.id
                self.kv['ptag'] = str(tag)
                self.kv['id'] = self.id
                self.id += 1

                if attributes.has_key('key'):
                    self.key = str(attributes['key'])

                if attributes.has_key('mdate'):
                    self.mdate = str(attributes['mdate'])

                if attributes.has_key('publtype'):
                    self.publtype = str(attributes['publtype'])
            elif tag in sub_tags:
                self.kv['sub_tag'] = str(tag)

    # 元素结束事件处理
    def endElement(self, tag):
        if tag == 'title':
            self.kv['title'] = str(self.title)

        elif tag == 'author':
            self.author = re.sub(' ','_', str(self.author))
            if self.kv.has_key('author') == False:
                self.kv['author'] = []
                self.kv['author'].append(str(self.author))
            else:
                self.kv['author'].append(str(self.author))

        elif tag in sub_tags:
            self.kv['sub_detail'] = str(self.subtext)

        elif tag == 'url':
            self.kv['url'] = str(self.url)

        elif tag == 'year':
            self.kv['year'] = str(self.year)

        elif tag in paper_tags:
            tid = int(self.kv['id']) if self.kv.has_key('id') else 0
            ptag = self.kv['ptag'] if self.kv.has_key('ptag') else 'NULL'

            try:
                title = self.kv['title'] if self.kv.has_key('title') else 'NULL'
            except Exception, e:
                title = ''
            author = self.kv['author'] if self.kv.has_key('author') else 'NULL'
            author = ','.join(author) if author is not None else 'NULL'
            subtag = self.kv['subtag'] if self.kv.has_key('subtag') else 'NULL'
            sub_detail = self.kv['sub_detail'] if self.kv.has_key('sub_detail') else 'NULL'
            year = self.kv['year'] if self.kv.has_key('year') else 0
            url = self.kv['url'] if self.kv.has_key('url') else 'NULL'
            mdate = self.kv['mdate'] if self.kv.has_key('mdate') else 'NULL'
            pkey = self.kv['pkey'] if self.kv.has_key('pkey') else 'NULL'
            publtype = self.kv['publtype'] if self.kv.has_key('publtype') else 'NULL'
            param = (str(tid), ptag, title, author, subtag, sub_detail, year, url, mdate, pkey, publtype)

            # 只抽取其中的会议论文
            if url.find('db/conf') >= 0:
                self.params.append(param)

            if len(self.params) % self.batch_len == 0:
                print len(self.params)
                sql = "insert into paper_conf(id, ptag, title, author, subtag, sub_detail, year, url, mdate, pkey, publtype) values(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
                self.util.execute_sql_params(sql, self.params)
                self.params = []

    # 内容事件处理
    def characters(self, content):
        if self.curtag == "title":
            self.title = content.strip()
        elif self.curtag == "author":
            self.author = content.strip()
        elif self.curtag in sub_tags:
            self.subtext = content.strip()
        elif self.curtag == "year":
            self.year = content.strip()
        elif self.curtag == "url":
            self.url = content.strip()

## python parser.py dblp-2015-03-02.xml
if __name__ == "__main__":

    filename = 'test.xml'
    if len(sys.argv) == 2:
        filename = sys.argv[1]

    if os.path.exists(filename) == False:
        print '[%s] not exists!' % filename
        exit(1)

    # 创建一个 XMLReader
    parser = xml.sax.make_parser()

    # turn off namepsaces
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)

    # 重写 ContextHandler
    Handler = MovieHandler()
    parser.setContentHandler( Handler )

    parser.parse(filename)
    print 'Parser Complete!'

整个代码:【下载地址, 访问密码:ff52】
共有如下文件:
create_table.sql:创建dblp数据表
parser.py:解析xml–>mysql数据库
mysql_util:连接mysql
gen_data.py:从mysql数据库抽取部分相关的数据
porter_stemmer.py:对文本进行词干化处理

声明:本文仅对相关数据集进行说明,并提供相应的链接,如需转载,请注明本文链接:http://blog.csdn.net/wzgang123/article/details/51131910

你可能感兴趣的:(实验数据集)