NLP之路-Dataset大全

The Enron corpus with an extensive annotation of organizational hierarchy.
http://www1.ccls.columbia.edu/~rambow/enron/

======================================================================================================================

CMU AI Repository (非常全!!!)

http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/util/areas/

Names Corpus

http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/util/areas/nlp/corpora/names/
======================================================================================================================

Email Datasets

http://www.cs.cmu.edu/~einat/datasets.html

Email Datasets


1. Personal Name Annotation

Due to privacy issues, it is very hard to get hold of large and realistic email corpora. Here you can find
a couple of email datasets, as well as a dataset of news groups text - annotated with personal names spans.

The full description of these datasets, including relevant statistics and references, is available in:

Einat Minkov, Richard C. Wang &William W. Cohen, Extracting Personal Names from Emails:
Applying Named Entity Recognition to Informal Text
, in HLT/EMNLP 2005(PDF)

Some fast details:

  • The email corpora given here were extracted from the Enron corpus, made public by the Federal
    Agency Regulatory commission. A version of this data was later purchased by the CALO project,
    and made available for research purposes.
  • The first dataset, 'Enron-Meetings', consists of all messages located in folders named "meetings"
    or "calendar" (excluding a few very large files). Most of these messages are meeting related. The second
    subset, 'Enron-Random', was formed by uniformly sampling a user name (out of 158 users) and then
    randomly sampling an email from that user.
  • As a second type of informal text, we also annotated a collection of newsgroups postings. The
    'Newsgroups' dataset was extracted from the 20Newsgroups corpus, by Vitor R. Carvalho.
  • These datasets are given here in a Minorthird format (plain text, with separate labels files), as well as
    in a 'general' format, where the personal labels are embedded in the text using XML tags.
  • The given zipped files construct a directory tree. The separation into train and test folders corresponds
    to the data splits described in the abovementioned paper. Further separation is for convenience purposes.

Download:

  • Enron Meetings:
Minorthird format , XML tags
Enron - random : Minorthird format , XML tags
NewsGroups : Minorthird format , XML tags




2. Person name disambiguation and threading

Here you can download Enron corpora and datasets, used for the general problems of entity disambiguation
and the extraction of inter-entity relations. Email here is represented as a relational database, which includes
text. Specifically, the tasks considered in these subsets of the Enron corpus are person name disambiguation
in email and intelligent message threading.

Two variations of the data are provided:

A. row email essages, and the corresponding datasets (queries and correct answers), as used in

  • Einat Minkov,
William W. Cohen and Andrew Y. Ng,
"Contextual Search and Name Disambiguation in Email using Graphs", SIGIR 2006 (PDF)

Download: Person name diambiguation corpora, datasets
Threading corpora , datasets


B. graph files (net relations and entity declarations), and the corresponding datasets, as used in

  • Einat Minkov and
William W. Cohen,
"Learning to Rank Typed Graph Walks: Local and Global Approaches",
WebKDD and SNA-KDD joint workshop 2007 (PDF)

Download: Person name diambiguation corpora, datasets
Threading corpora , datasets

Note: the corpora files of (A) and (B) are different representation of the same data (where reply lines
have been removed in the latter). The datasets are mostly identical, with the exception that some examples
were moved from the training and test sets to a development set.

======================================================================================================================

Software and Datasets


 

Software: Jangada

 

Jangada is an API for signature block extraction and reply-to extraction from email messages. The ideas follow the ideas of the following paper (CEAS2004 - Learning to Extract Signature and Reply Lines from Email),, but performance was slightly improved by using a new set of features not mentioned in the original reference.

 

Some Features:Extracts signature blocks and reply lines in email messages with very good accuracy. Can be easily integrated in other Java applications (For instance, the entire email message as a String can be used as input). Can be easily integrated in other Minorthird applications (using the TextLabels format, it accepts as input email messages with other annotations - such as dates, personal names, speech acts, etc)

 

Licensing:University of Illinois/NCSA Open Source License

Documentation: Very poor. An initial javadocs page ishere. There is some documentation on how to use Jangada in the example files below.

Requires:j2sdk1.4 or later. Uses MinorThird.jar.

Recommended: When using email files as input, results will be better if the messages are in mime (.eml) format.

 

Usage example:

1.     create a new directory (for instance,jangadaDir)

2.     downloadjangada.jar,minorThird.jar, the example files, and the email files to jangadaDir

3.     Unzip (gunzip Demos.tar.gz) and Untar (tar –xvf Demos.tar) the example files, as well as the email files.

4.     addjangadaDir, jangadaDir/minorThird.jar and jangadaDir/jangada.jar to the CLASSPATH

5.      

6.     For a quick demo,

7.     compile the example files. For instance: “javac Demo2.java” – (in case of errors, please check you CLASSPATH again)

8.     run the examples on the email files directory: “java Demo2 emails/*”

9.     Check the documentation on the DemoX.java files and try your own application.

 

 

Reminder 1:if you’d like to have access to the source code, please send me an email.

Reminder 2:If you used this package, please cite the following reference:

·       Learning to Extract Signature and Reply Lines from Email,Vitor R. Carvalho and William W. Cohen, CEAS-2004 (Conference on Email and Anti-Spam),Mountain View,CA,July 2004


 

 

Software: Ciranda

 

A java application that predicts the Email-Acts (or email speech-Acts) of email messages. The ideas follow the contents of the following papers (emnlp04 and sigir05), but performance was significantly improved by careful feature selection and additional features.

 

Some Features:

Predicts the following acts: Request, Commit, Deliver, Propose, Meet, dData.

Provides the confidence in each prediction.

Easy way to use these acts as features in your application.

 

Licensing:No guarantees are provided. Lots of bugs for sure.Use at your own risk!

Documentation: Very poor. An initial javadocs page ishere. Please check Example.java on how to use it.

Requires:j2sdk1.4 or later. Uses MinorThird.jar (see below)

Questions:I’ll be happy to help, especially if you tell me what a good Ciranda is :-)

 

Usage example:

1.     create a new directory calledciranda, and ciranda/lib

2.     downloadciranda.jar and minorThird.jar to ciranda/lib

3.     addciranda/ andlib/ciranda.jar to the CLASSPATH

4.     download the example fileExample.java to ciranda/

5.     compile it: “javac Example.java” – (in case of errors, please check you CLASSPATH again)

6.     run the example: “java Example”

7.     or run the main application on a directory with emails in text format (without headers)

8.     create the test directoryciranda/testdir

9.     add some emails in text format (such asmsg1,msg2,msg3) to ciranda/testdir

10.run “java –jar lib/ciranda.jar testdir”

11.or try your own application.

 

Reminder:Send me an email if you'd like the source code. If you use this package, please use the following reference:

·       Learning to Classify Email into ”Speech Acts”,,William W. Cohen, Vitor R. Carvalho and Tom M. Mitchell, EMNLP-2004 (Conference on Empirical Methods in Natural Language Processing), Barcelona, Spain, July 2004

 

 


 

 

Dataset: Signature and Reply Dataset [Datasets in Minorthird Format]

 

These 617 email messages have signature lines and reply-to lines annotations. The messages are a subset of the 20 Newsgroups dataset (produced by Ken Lang at CMU in the mid-90's).

 


 

 

 

 

Back to Vitor Carvalho’s Home page

======================================================================================================================

The Enron dataset seems to be popular, email often has privacy restrictions, and the Enron set has no restrictions. The Enron stuff will be 2001 and earlier.

The Enron datasets at CMU:

http://www.cs.cmu.edu/~einat/dat...

List of the Enron data in other places, and  variations:
http://infochimps.com/search?que...

Here is a source for chat postings, which should be similar to email.
However, it is from the Naval Postgraduate School in Monterey, CA so
it may not be as "normal", but it is 2006
http://faculty.nps.edu/cmartell/...

That's the best info I could find immediately.
**** ****
The are some academic resources here:
http://www.clres.com/corparchive...

There are a number of these datasets listed here:
http://infochimps.com/tags/text?...


======================================================================================================================

Spam email datasets

http://www.csmining.org/index.php/spam-email-datasets-.html

======================================================================================================================

ACL SIGLEX Links to the CORPORA Mailing List Archive

http://www.clres.com/corparchive.html

Selected messages to the CORPORA mailing list have been categorized and links to the threads have been provided. The categorization is based on a SIGLEX ontology. The links have been generated automatically based on subject, the date, and the sender. The links include only the years 1997 to the present. Before 2000, the CORPORA archive is not threaded. This system is based on the webmaster's categorization and ontology, both of which can easily be modified, for which your suggestions are solicited. Messages can be put into multiple categories. New categories can easily be created. Existing categories can easily be renamed and reorganized. Many messages have been categorized but do not appear here; we are working to improve the automatic linking.

SIGLEX Lexical Resources

  • Corpus Linguistics
    • Definitional Issues
    • Representativeness
    • History
    • Legal Issues
    • Course Design
    Corpora
    • Linguistically Annotated Corpora
      • English Corpora
        • British National Corpus
        • Brown Corpus
    • Multilingual Corpora
    • Written Language Corpora
      • Sublanguage Corpora
        • Learner Corpora
    • Spoken Language Corpora
    • Language Specific
  • Lexicons
    • Thesauri WordNets
    • New Sense Discovery
    • Language Phenomena
  • Text Tokenisation
    • Stop Lists
    • Text Format Conversions
    • Tokenizers
    • Markup
    • Sentence Splitting
    • Spellchecking
  • Concordancing
    • Collocations
    • Lexical Cohesion
  • Tagging
    • POS-Tagging
  • Mathematical Methods
    • Mutual Information
    • Perplexity
    • Maximum Entropy
    • Chi-Square
    • N-Gram Analysis
    • Frequency Analysis
    • Significance Tests
    • Semantic Similarity
  • Grammars
  • Software
    • Taggers

Last modified August 12, 2004

Maintained by Ken Litkowski ([email protected])


======================================================================================================================

http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf


======================================================================================================================

你可能感兴趣的:(原创,dataset,NLP,机器学习)