目录
Large Movie Review Dataset v1.0大型电影评论数据集v1.0
数据集下载地址:http://ai.stanford.edu/~amaas/data/sentiment/
1. Overview(概述)
此数据集包含电影评论及其关联的二进制情绪极性标签。它旨在作为情绪分类的基准。本文档概述了如何收集数据集,以及如何使用提供的文件。
2. Dataset (数据集)
核心数据集包含50,000个评论,均匀分为25k训练集和25k测试集。标签的整体分布是平衡的(25k pos和25k neg)。我们还包括另外50,000个未标记文档,用于无监督学习。在整个系列中,任何给定的电影都不允许超过30条评论,因为对同一部电影的评论往往具有相关评级。此外,训练集和测试集包含一组不相交的电影,因此通过记忆电影唯一的术语及其与观察到的标签相关联,不会获得显着的性能。在标记的训练/测试集中,负面评论的评分<= 4分,正评价的分数> = 7分(满分10分)。因此,评分更中性的评分不包括在训练/测试集中。在无监督的集合中,包括任何评级的评论,并且偶数个评论> 5且<= 5。
3. Files
有两个顶级目录[train /,test /]对应于训练和测试集。每个包含用于评论的[pos /,neg /]目录,其中二进制标签为正数和负数。在这些目录中,评论存储在按照惯例[[id] _ [评级] .txt]命名的文本文件中,其中[id]是唯一ID,[rating]是1-10评级的评论的星级评定。例如,文件[test / pos / 200_8.txt]是来自IMDb的具有唯一ID 200和星级8/10的正标记测试集示例的文本。 [train / unsup /]目录的所有评级都为0,因为这部分数据集的省略了评级。
我们在单独的[urls_ [pos,neg,unsup] .txt]文件中包含每个评论的IMDb URL。具有唯一ID 200的评论将在该文件的第200行上具有其URL。由于IMDb不断变化,我们无法直接链接到评论,只能链接到电影的评论页面。
除了评论文本文件,我们还包括在我们的实验中使用的已经标记化的词袋(BoW)功能。它们存储在train / test目录中的.feat文件中。每个.feat文件都是LIBSVM格式,标记数据的ascii稀疏矢量格式。这些文件中的特征索引从0开始,并且在[imdb.vocab]中找到与特征索引对应的文本标记。因此.feat文件中0:7的行意味着[imdb.vocab](the)中的第一个单词在该评论中出现7次。
[imdbEr.txt],其中包含[imdb.vocab]中每个标记的预期评级(由Potts,2011)计算。预期评级是了解数据集中单词平均极性的好方法。
4. Citing the dataset(数据集引用许可)
When using this dataset please cite our ACL 2011 paper which introduces it. This paper also contains classification results which you may want to compare against.
使用此数据集时,请引用我们的ACL 2011论文来介绍它。本文还包含您可能想要比较的分类结果。(见原文)
Overview
This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.
Dataset
The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.
In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels. In the labeled train/test sets, a negative review has a score <= 4 out of 10, and a positive review has a score >= 7 out of 10. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an
even number of reviews > 5 and <= 5.
Files
There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset.
We also include the IMDb URLs for each review in a separate [urls_[pos, neg, unsup].txt] file. A review with unique id 200 will
have its URL on line 200 of this file. Due the ever-changing IMDb, we are unable to link directly to the review, but only to the movie's review page.
In addition to the review text files, we include already-tokenized bag of words (BoW) features that were used in our experiments. These are stored in .feat files in the train/test directories. Each .feat file is in LIBSVM format, an ascii sparse-vector format for labeled data. The feature indices in these files start from 0, and the text tokens corresponding to a feature index is found in [imdb.vocab]. So a line with 0:7 in a .feat file means the first word in [imdb.vocab] (the) appears 7 times in that review.
LIBSVM page for details on .feat file format: http://www.csie.ntu.edu.tw/~cjlin/libsvm/
We also include [imdbEr.txt] which contains the expected rating for each token in [imdb.vocab] as computed by (Potts, 2011). The expected rating is a good way to get a sense for the average polarity of a word in the dataset.
Citing the dataset
When using this dataset please cite our ACL 2011 paper which introduces it. This paper also contains classification results which you may want to compare against.
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
References
Potts, Christopher. 2011. On the negativity of negation. In Nan Li and David Lutz, eds., Proceedings of Semantics and Linguistic Theory 20, 636-659.
Contact
For questions/comments/corrections please contact Andrew Maas [email protected]