cora数据集

转自:http://blog.sina.com.cn/s/blog_4c98b96001000boc.html --苯苯的小田园

真是找的很辛苦,唉!记下来吧.感谢论文Object Identication with
Attribute-Mediated Dependences提供了cora dataset 的来源:
http://www.cs.umass.edu/~mccallum/data/(如果复制打不开,请自己手动敲到地址栏中)
   论文A Pitfall and Solution in Multi-Class Feature Selection for Text Classification提供了启发,cora是有6大类,36个小类的.这样一来终于解决了相关性的难题.

(a)cora-refs.tar.gz数据集
Cora Citation Matching [reference matching, object correspondence]Text of citations hand-clustered into groups referring to the same paper.
(b) cora-ie.tar.gz数据集
Cora Information Extraction [information extraction] Research paper headers and citations, with labeled segments forauthors, title, institutions, venue, date, page numbers and severalother fields.
(c)cora-classify.tar.gz 数据集
Cora Research Paper Classification [relational document lassification]
Research papers classified into a topic hierarchy with 73 leaves. We call this a relational data set, because the citations provide relations among papers.
(d) cora-hmm.tar.gz
   Cora HMM is the C implementation of HMMs used for information extraction in Cora. It was written by Kristie Seymore.


Cora readme

   Note that in Cora there are two types of papers: those we found on the
Web, and those that are referenced in bibliography sections.  It is
possible that a paper we found on the Web is also referenced by other
papers.


FILE SUMMARY:

* The file 'papers' contains limited information on the papers we found
on the Web.

* The file 'citations' contains the citation.

* The file 'classifications' contains class labels

* The directory `extractions' contains the extracted authors, title,
abstract, etc, plus the references (and in some cases surrounding
text). from the postscript papers we found on the Web.


PAPERS

The file `papers' has a list of all the postscript file papers.
Three fields, tab separated:

   <id> <filename> <citation string>

There are about about 52000 lines in this file, but there are a bunch
of papers that have more than one postscript file.  If you eliminate
lines with duplicate ids there are about 37000 papers.  Note the
citation string is either (1) an arbitrary bibliography reference to
the paper, if one was made or (2) a constructed entry based on the
authors and title extracted from the postscript file.


CITATIONS

The file 'citations' has the citation graph.  Two fields, tab
separated:

   <referring_id> <cited_id>

The referring_id is the id of the paper that has the bibliography
section (always one we have postscript for).  The cited_id is the
paper referenced (we may or may not have postscript for it).  There
are about 715000 citations.


CITATIONS.WITHAUTHORS

The file 'citations.withauthors' contains another copy of the
citation graph.  This time we have also included authors and file
names of each paper in addition to each papers' unique paper_id and
the paper_id's of the references they make. The format of this file
is:

   ***
   this_paper_id
   filename
   id_of_first_cited_paper
   id_of_second_cited_paper
   .
   .
   .
   *
   Author#1 (of this paper)
   Author#2
   .
   .
   .

CLASSIFICATIONS

The file `classifications' contains the research topic classifications
for each of the files. The format of the file is:
"filename"+"\t"+"classification".  For example:

  http:##www.ri.cmu.edu#afs#cs#user#alex#docs#idvl#dl97.ps    /Information_Retrieval/Retrieval/


The file name is the url where the paper came translated to file name
by changing / to #.  The classification the label name in the Cora
directory hierarchy.

Note that the class labels were not perfectly assigned.


EXTRACTIONS

The directory 'extractions' contains 52906 files, one for each
postscript paper that we found on the Web.  The directory contains so
many files, that you probably don't want to 'ls' it.  Commands like
`find extractions -print' will probably work more efficiently.

Each filename in the 'papers' file should have a file here.  I believe
there are also some (perhaps many?) extra files in this tarball that
are not in paper-data that you can just ignore.

Each line of each file corresponds to some bit of data about the
postscript file.  Most of the MIME-like field tags are
straightforward and explanatory.  A few notes:

The fields URL, Refering-URL, and Root-URL are given by the spider.
All other fields are extracted automatically from the text, some by
hand-coded regular expressions and some by an HMM information
extractor.

The fields Abstract-found and Intro-found are binary valued indicators
of whether Abstract and/or Introduction sections were found by some
regular expression matching in the paper.

Each Reference field is one bibliography entry found at the end of the
paper.  Note they are marked up using SGML-like tags.  Each Reference
field is optionally followed by one (and possibly more?)
Reference-context fields that are snippets of the postscript file
around where the reference was cited.

你可能感兴趣的:(Web,IE,Blog,UP)