NLP 常用语料库

1.Sogou News Corpus

搜狗新闻语料库. Containing in total 2,909,551 news articles in various topic channels.
参考文献[1] 中是这么描述与使用的: :

There are a large number categories but most of them contain only few articles. We choose 5 categories – “sports”, “finance”, “entertainment”, “automobile” and “technology”. The number of training samples selected for each class is 90,000 and testing 12,000.

2. YFCC 100M

YaHoo 实验室的多媒体数据集, 用处不局限于NLP. 地址在参考文献[3]中.
内含约 1亿 张图片 与 100 万个视频, 有 标题, 说明 与 标签. 即 title, captions and tags.
它的标注是多元的, 比如一只小狗, 会被标注 动物/小狗/宠物/狮子狗 等.
FastText 论文中, 用到了它作 Tag Prediction.

参考

  1. Character-level Convolutional Networks for Text Classification
  2. 搜狗实验室
  3. YFCC 100M

你可能感兴趣的:(NLP)