情感分析——数据集

1、斯坦福大学Stanford Sentiment Treebank成为一个标准数据集,又分为两个任务,
一个是二分,6920/872/1821;一个是五分(very negative, negative, neutral, positive, very positive)包含11855个句子和215154个短语的标记(5类)。8544个训练集,1101个验证集和2210个测试集。

2、IMDB:包含10W个评论文本,25000个训练(2类),25000个测试,而且已经均衡处理过的,还有50000个未标注。


3、Yelp:包含business、user、review、tip和checkin信息。


4、Amazon评论

5、SemEval

SemEval2014:

    restaurant领域:3842个句子,3041个训练语句,800个测试句子;

    laptop领域:     3845个句子,3045个训练,800个测试。

    共pos, neg, neu, conflict四类。Tang在其文章中没考虑conflict。

Dataset

Pos.

Neg.

Neu.

Laptop-Train

Laptop-Test

Restaurant-Train

Restaurant-Test

994

341

2164

728

870

128

807

196

464 2328

169 638

637 3608

196 1120

数据示例:

Although we were looking for regular lettuce and some walnuts the salads we got were great.

SemEval2015:

    restaurant领域:2000个训练语句(350 reviews), 48个验证句子(10reviews),676个测试句子(90 reviews);

    laptop领域:    2500个训练(450 reviews),55个验证(10 reviews), 808个测试(80 reviews)。

 共pos, neg, neu三类。


rest数据示例:

Went on a 3 day oyster binge, with Fish bringing up the closing, and I am so glad this was the place it O trip ended, because it was so great!Service was devine, oysters where a sensual as they come, and the price can't be beat!!!You can't go wrong here.

laptop数据示例(没有标记aspect term):

the laptop was really good and it goes really fast just the way i thought it would of run.i would really recommend to any person out there to get this laptop cause its really worth it.and its really cheap and you wont regret buying it.


6、Stanford Twitter Sentiment(STS):包含1.6M个推特(2类),作者随机选择了80K作为训练集,16K作为验证集,498个作为测试机。

你可能感兴趣的:(情感分析)