文本分类数据集汇总

统计了下载到的文本分类数据集信息,汇总成表格如下(时间:2020.7.1):

Dataset Classes Type Samples Best Method Performance
AG News 4 Topic Train:120000 Test: 7600 XLNet Error: 4.45
Dbpedia 14 Topic Train: 560000 Test: 70000 XLNet Error: 0.6
TREC-6 6 Question Train: 5452 Test: 500 USE_T+CNN Error: 1.93
TREC-50 50 Question Train: 5452 Test: 500 Rules Error: 2.8
20NEWS 20 Topic 20,000 SGC Acc: 88.5
IMDb 2 Sentiment Train: 25,000 Test: 25,000 XLNet Acc: 96.8
Yahoo! Answers 10 Question Train: 1,400,000 Test: 60,000 BERT-ITPT-FiT Acc: 77.62
R8 8 Topic Train: 5,485 Test: 2,189 NABoE-full Acc: 97.9
Ohsumed 23 疾病分类 50,216 SGCN Acc: 68.5
Sogou News 5 Topic Train: 450,000 Test: 60,000 BERT-ITPT-FiT Acc: 98.07
Amazon-2 2 评分1-2: negative 4-5: positive per class Train: 1,800,000 Test: 200,000 XLNet Error: 2.11
Amazon-5 5 用户评分1-5 per class Train: 600,000 Test: 130,000 XLNet Error: 31.67
Yelp-2 2 1-2: negative 4-5: positive per class Train: 130,000 Test: 10,000 XLNet Acc:98.63
Yelp-5 5 用户评分1-5 per class Train:130,000 Test: 10,000 HANNN Acc: 73.28
Reuters-21578 90 Topic Train:7769 Test: 3019 MPAD-path Acc: 97.44
Cora 7 论文分类:如:遗传算法 2708 ACNet Acc: 83.5
BBCSports 5 Topic 737 MPAD-path Acc: 99.59
WOS-11967 35, 7父类 论文类别: 如: CS->computer graphics 11967 RMDL Acc:91.59
WOS-46985 134, 7父类 论文类别: 如: CS->computer graphics 46985 RMDL Acc:82.42
WOS-11967 11, 3父类 论文类别: 如: CS->computer graphics 5736 RMDL Acc:93.57

未能下载的数据集:DODF Data,MVICTOR(type),RCV1,TRAC2-Benghali. Task 2., TRAC2-English. Task2.,AffCon 2020 Emotion
Detection,IMDb-M,AAPD,Yelp-14,Reuters En-De,Reuters De-En,MPQA,HoC

参考链接:
Text Classification
Document Classification

鉴于有些朋友需要资源,免费开放下载链接,拿资源请点个赞,万分感谢!!!
链接:https://pan.baidu.com/s/10jFP1CfE-HyCVVCYY9XZWw
提取码:0617

你可能感兴趣的:(text,classification)