文本分类数据集_PU-Learning/文本分类/文本聚类/情感分析 部分数据集

本文为PU-Learning/文本分类/文本聚类/情感分析相关研究提供部分常用数据集下载地址

(所有数据集都有大量文献使用,暂时只列一篇代表性文章)

  • Lang K . NewsWeeder : Learning to filter net-news[C]// Twelfth International Conference on International Conference on Machine Learning. Morgan Kaufmann Publishers Inc. 1995.

Sources:http://qwone.com/~jason/20Newsgroups/

  • Craven M, Freitag D, Mccallum A, et al. Learning to Extract Symbolic Knowledge from the World Wide Web[C]// Proc of the National Conference on Artificial Intelligence. 1998.

Sources:http://www.cs.cmu.edu/~webkb/

  • M. Ott, Y. Choi, C. Cardie, and J.T. Hancock. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.

Sources:https://myleott.com/op-spam.html

  • Reuters-21578

Sources:http://www.daviddlewis.com/resources/testcollections/reuters21578/

  • IMDb data

Sources:https://datasets.imdbws.com/

  • Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan,Thumbs up? Sentiment Classification using Machine Learning Techniques,Proceedings of EMNLP 2002.

Sources:http://www.cs.cornell.edu/people/pabo/movie-review-data/

  • Jindal N, Liu B. Opinion spam and analysis[C]// International Conference on Web Search & Data Mining. 2008.

Sources:https://www.cs.uic.edu/~liub/FBS/fake-reviews.html

  • Li J , Ott M , Cardie C , et al. Towards a General Rule for Identifying Deceptive Opinion Spam[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2014.

Sources:https://web.stanford.edu/~jiweil/Code.html

——————————————————————————————————————

09.23. 补充:

  • 所有UCI((University of CaliforniaIrvine))数据(机器学习算法的测试大多采用的便是UCI数据集了,其重要之处在于“标准”二字,预处理过程会相对简单很多)

Sources:http://archive.ics.uci.edu/ml/datasets.php

——————————————————————————————————————

09.25. 补充:

  • NASA 2008PHM竞赛的数据集,主要用于预测性维护

Sources:https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-data-repository/

  • Reuters-21578 Text Categorization Collection , 该数据集通常用于文本分类

Sources:http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

——————————————————————————————————————

10.13. 补充

  • 数据集20 NewsgroupsReuters-21578 R8Reuters-21578 R52以及WebKB的各个预处理版本

Sources:https://www.cs.umb.edu/~smimarog/textmining/datasets/

or

Sources:http://ana.cachopo.org/datasets-for-single-label-text-categorization

你可能感兴趣的:(文本分类数据集)