Email Datasets
http://www.cs.cmu.edu/~einat/datasets.html
Email Datasets
Due to privacy issues, it is very hard to get hold of large and realistic email corpora. Here you can find
a couple of email datasets, as well as a dataset of news groups text - annotated with personal names spans.
The full description of these datasets, including relevant statistics and references, is available in:
Einat Minkov, Richard C. Wang &William W. Cohen, Extracting Personal Names from Emails:
Applying Named Entity Recognition to Informal Text, in HLT/EMNLP 2005(PDF)
Some fast details:
Download:
Here you can download Enron corpora and datasets, used for the general problems of entity disambiguation
and the extraction of inter-entity relations. Email here is represented as a relational database, which includes
text. Specifically, the tasks considered in these subsets of the Enron corpus are person name disambiguation
in email and intelligent message threading.
Two variations of the data are provided:
A. row email essages, and the corresponding datasets (queries and correct answers), as used in
B. graph files (net relations and entity declarations), and the corresponding datasets, as used in
Note: the corpora files of (A) and (B) are different representation of the same data (where reply lines
have been removed in the latter). The datasets are mostly identical, with the exception that some examples
were moved from the training and test sets to a development set.
======================================================================================================================
Software and Datasets
Software: Jangada
Jangada is an API for signature block extraction and reply-to extraction from email messages. The ideas follow the ideas of the following paper (CEAS2004 - Learning to Extract Signature and Reply Lines from Email),, but performance was slightly improved by using a new set of features not mentioned in the original reference.
Some Features:Extracts signature blocks and reply lines in email messages with very good accuracy. Can be easily integrated in other Java applications (For instance, the entire email message as a String can be used as input). Can be easily integrated in other Minorthird applications (using the TextLabels format, it accepts as input email messages with other annotations - such as dates, personal names, speech acts, etc)
Licensing:University of Illinois/NCSA Open Source License
Documentation: Very poor. An initial javadocs page ishere. There is some documentation on how to use Jangada in the example files below.
Requires:j2sdk1.4 or later. Uses MinorThird.jar.
Recommended: When using email files as input, results will be better if the messages are in mime (.eml) format.
Usage example:
1. create a new directory (for instance,jangadaDir)
2. downloadjangada.jar,minorThird.jar, the example files, and the email files to jangadaDir
3. Unzip (gunzip Demos.tar.gz) and Untar (tar –xvf Demos.tar) the example files, as well as the email files.
4. addjangadaDir, jangadaDir/minorThird.jar and jangadaDir/jangada.jar to the CLASSPATH
5.
6. For a quick demo,
7. compile the example files. For instance: “javac Demo2.java” – (in case of errors, please check you CLASSPATH again)
8. run the examples on the email files directory: “java Demo2 emails/*”
9. Check the documentation on the DemoX.java files and try your own application.
Reminder 1:if you’d like to have access to the source code, please send me an email.
Reminder 2:If you used this package, please cite the following reference:
· Learning to Extract Signature and Reply Lines from Email,Vitor R. Carvalho and William W. Cohen, CEAS-2004 (Conference on Email and Anti-Spam),Mountain View,CA,July 2004
Software: Ciranda
A java application that predicts the Email-Acts (or email speech-Acts) of email messages. The ideas follow the contents of the following papers (emnlp04 and sigir05), but performance was significantly improved by careful feature selection and additional features.
Some Features:
Predicts the following acts: Request, Commit, Deliver, Propose, Meet, dData.
Provides the confidence in each prediction.
Easy way to use these acts as features in your application.
Licensing:No guarantees are provided. Lots of bugs for sure.Use at your own risk!
Documentation: Very poor. An initial javadocs page ishere. Please check Example.java on how to use it.
Requires:j2sdk1.4 or later. Uses MinorThird.jar (see below)
Questions:I’ll be happy to help, especially if you tell me what a good Ciranda is :-)
Usage example:
1. create a new directory calledciranda, and ciranda/lib
2. downloadciranda.jar and minorThird.jar to ciranda/lib
3. addciranda/ andlib/ciranda.jar to the CLASSPATH
4. download the example fileExample.java to ciranda/
5. compile it: “javac Example.java” – (in case of errors, please check you CLASSPATH again)
6. run the example: “java Example”
7. or run the main application on a directory with emails in text format (without headers)
8. create the test directoryciranda/testdir
9. add some emails in text format (such asmsg1,msg2,msg3) to ciranda/testdir
10.run “java –jar lib/ciranda.jar testdir”
11.or try your own application.
Reminder:Send me an email if you'd like the source code. If you use this package, please use the following reference:
· Learning to Classify Email into ”Speech Acts”,,William W. Cohen, Vitor R. Carvalho and Tom M. Mitchell, EMNLP-2004 (Conference on Empirical Methods in Natural Language Processing), Barcelona, Spain, July 2004
Dataset: Signature and Reply Dataset [Datasets in Minorthird Format]
These 617 email messages have signature lines and reply-to lines annotations. The messages are a subset of the 20 Newsgroups dataset (produced by Ken Lang at CMU in the mid-90's).
Back to Vitor Carvalho’s Home page
======================================================================================================================
The Enron dataset seems to be popular, email often has privacy restrictions, and the Enron set has no restrictions. The Enron stuff will be 2001 and earlier.
The Enron datasets at CMU:
http://www.cs.cmu.edu/~einat/dat...
List of the Enron data in other places, and variations:
http://infochimps.com/search?que...
Here is a source for chat postings, which should be similar to email.
However, it is from the Naval Postgraduate School in Monterey, CA so
it may not be as "normal", but it is 2006
http://faculty.nps.edu/cmartell/...
That's the best info I could find immediately.
**** ****
The are some academic resources here:
http://www.clres.com/corparchive...
There are a number of these datasets listed here:
http://infochimps.com/tags/text?...
======================================================================================================================
Spam email datasets
http://www.csmining.org/index.php/spam-email-datasets-.html
======================================================================================================================
http://www.clres.com/corparchive.html
Selected messages to the CORPORA mailing list have been categorized and links to the threads have been provided. The categorization is based on a SIGLEX ontology. The links have been generated automatically based on subject, the date, and the sender. The links include only the years 1997 to the present. Before 2000, the CORPORA archive is not threaded. This system is based on the webmaster's categorization and ontology, both of which can easily be modified, for which your suggestions are solicited. Messages can be put into multiple categories. New categories can easily be created. Existing categories can easily be renamed and reorganized. Many messages have been categorized but do not appear here; we are working to improve the automatic linking.
SIGLEX Lexical Resources
Last modified August 12, 2004
Maintained by Ken Litkowski ([email protected])======================================================================================================================
http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf