本博客系博主个人理解和整理所得,包含内容无法详尽,如有补充,欢迎讨论。
这里只提供数据集相关介绍和来源出处,或者下载地址等,因版权原因不提供数据集所含的元数据。如有需要,请自行下载。
“Complete dataset cannot be distributed because of Twitter privacy policies and news publisher copy rights. Social engagements and user information are not disclosed because of Twitter Policy.”
Fake News Detection | Papers With Code
GitHub - ICTMCG/fake-news-detection: This repo is a collection of AWESOME things about fake news detection, including papers, code, etc.
论文:Combating Fake News: A Survey on Identification and Mitigation Techniques
论文:TACL2022 A Survey on Automated Fact-Checking
论文:Fake news detection: A hybrid CNN-RNN based deep learning approach
论文:The Surprising Performance of Simple Baselines for Misinformation Detection,该论文的github项目地址GitHub - ComplexData-MILA/misinfo-baselines有本文实验所用的数据集。
https://github.com/KaiDMML/FakeNewsNet
FakeNewsNet | Kaggle
GitHub - SaschaStenger/FakeNewsNet_modified
GitHub - hwang219/AttackFakeNews
有images;
数据集的下载地址:
GitHub - entitize/Fakeddit: r/Fakeddit New Multimodal Benchmark Dataset for Fine-grained Fake News Detection
GitHub - faiazrahman/Multimodal-Fake-News-Detection: Multi-Modal Fine-Grained Fake News Detection with Dialogue Summarization
https://github.com/cyang03/CHECKED
数据来源是从politifact网站上爬的;
数据集下载地址:
https://www.cs.ucsb.edu/~william/data/liar_dataset.zip
GitHub - thunlp/CED: source code for TKDE paper “CED: Credible Early Detection of Social Media Rumors”
GitHub - thunlp/Chinese_Rumor_Dataset: 中文谣言数据
http://alt.qcri.org/⇠wgao/data/rumdect.zip
https://www.dropbox.com/s/46r50ctrfa0ur1o/rumdect.zip?dl=0
dataset seeks to retrieve supporting evidence for single-sentence claims and classify the
claims as Supported, Refuted or NotEnoughInfo.
使用该数据集的论文:
1. 又一个PHEME [33],
2.PolitiFact5 is a website that manually assigns factcheck label to claims, along with the background information.
PolitiFact
使用该数据集的论文:[7]
3.Zlatkova et al. (2019) propose a dataset for fact-checking claims about images [3].
4.TabFact (Chen et al., 2019) presents semi-structural tables for fact verification [4].
5.The SemEval-2020 shared task (Da San Martino et al., 2020) centers on the detection of propaganda techniques in news articles, which is more linguistically oriented [5].
6.infosurgeon [6]通过修改news的KG或者ARM graph生成的虚假新闻:
7. Constraint 数据集[34];
跟covid-19有关的数据;
9.All-in-one: Multi-task Learning for Rumour Verifification.
10. 文章[20]提出了两个数据集Snopes和Politifact,有images。
we split the full dataset into two sub datasets called Snopes and Politifact datasets. The former one contains pairs where FC-articles are from snopes.com and the later one contains pairs where FC-articles are from politifact.com.
这是该文章的code和dataset地址(还没下载,因为在google云上):
https://github.com/nguyenvo09/EMNLP2020
11.ANTiVax 数据集[35];
跟covid-19有关,来自twitter;
12. FakeNews AMT [12]提出了该数据集
使用该数据集的论文:[10] [11]
13. Celeb [12]提出了该数据集
使用该数据集的论文:[10] [11]
14 PHEME 9 出自[13]
使用该数据集的论文:[10]
16.Sentimental-LIAR [14]提出的,对LIAR添加了情感信息和emotions。
21.又一个twitter数据集来自[17]:
https://www.dropbox.com/s/7ewzdrbelpmrnxu/rumdetect2017.zip?dl=0
由于twitter的限制,该数据集只提供了source tweet ID 及 source tweet content,还有传播树结构。
其他post相关内容需要自己根据id去爬。
这个好像就是text研究中用的很多的twitter15和twitter16,没有image内容。
[1] James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018.
Fever: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 809–819.
[2] F. Yu, Q. Liu, S. Wu, L. Wang, and T. Tan, “A convolutional approach for misinformation identification,” in Proceedings of IJCAI, 2017.
[3] Dimitrina Zlatkova, Preslav Nakov, and Ivan Koychev. 2019. Fact-checking meets fauxtography: Verifying claims about images. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2099–2108, Hong Kong, China. Association for Computational Linguistics.
[4] Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2019. Tabfact: A largescale dataset for table-based fact verification. arXiv preprint arXiv:1909.02164.
[5] Giovanni Da San Martino, Alberto Barron-Cedeno, ´Henning Wachsmuth, Rostislav Petrov, and Preslav Nakov. 2020. Semeval-2020 task 11: Detection of ropaganda techniques in news articles. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1377–1414.
[6] InfoSurgeon: Cross-Media Fine-grained Information Consistency Checking for Fake News Detection
[7] Embracing Domain Differences in Fake News: Cross-domain Fake News Detection
using Multimodal Data,AAAI 2021.
[8] Limeng Cui and Dongwon Lee. 2020. CoAID: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885
[9] Kai Shu, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2020a. Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. Big data, 8(3):171–188.
[10] An Emotion-Based Multi-Task Approach to Fake News Detection (Student Abstract)
[11] Saikh, T.; De, A.; Ekbal, A.; and Bhattacharyya, P. 2020. A Deep Learning Approach for Automatic Detection of Fake News. arXiv:2005.04938.
[12] Perez-Rosas, V.; Kleinberg, B.; Lefevre, A.; and Mihalcea, R. ´ 2018. Automatic Detection of Fake News. In Proceedings of the 27th International Conference on Computational Linguistics,
3391–3401. Santa Fe, New Mexico, USA: ACL.
[13] Zubiaga, A.; Liakata, M.; and Procter, R. 2016. Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media. arXiv:1610.07363.
[14] Upadhayay, B.; and Behzadan, V. 2020. Sentimental LIAR: Extended Corpus and Deep Learning Models for Fake Claim Classification. In 2020 IEEE International Conference on
Intelligence and Security Informatics (ISI), 1–6. IEEE.
[15] Arkaitz Zubiaga, Maria Liakata, and Rob Procter. 2017. Exploiting context for rumour detection in social media. In International Conference on Social Informatics. Springer, 109–123.
[16] Jing Ma, Wei Gao, Prasenjit Mitra, Sejeong Kwon, Bernard J Jansen, Kam-Fai Wong, and Meeyoung Cha. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of IJCAI 2016.
[17] Jing Ma, Wei Gao, Kam-Fai Wong. Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning. ACL 2017.
[18] IJCAI 2022, MFAN: Multi-modal Feature-enhanced Attention Networks for Rumor Detection;
[19] Changhe Song, Cheng Yang, Huimin Chen, Cunchao Tu, Zhiyuan Liu, and Maosong Sun.
Ced: Credible early detection of social media rumors. IEEE Transactions on Knowledge and Data Engineering, 33(8):3035–3047, 2019
[20] Nguyen Vo and Kyumin Lee. 2020. Where Are the Facts? Searching for Fact-checked Information to Alleviate the Spread of Fake News. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7717–7731, Online. Association for Computational Linguistics.
[21] Kai Nakamura, Sharon Levy, and William Yang Wang. 2020. Fakeddit: A New Multimodal Benchmark Dataset for Fine-grained Fake News Detection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6149–6157, Marseille, France. European Language Resources Association.
[22] ACL 2017,“Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection
[23] Ziyi Kou, Daniel Yue Zhang, Lanyu Shang, and Dong Wang. 2020. ExFaux: A Weakly Supervised Approach to Explainable Fauxtography Detection. In 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, December 10-13, 2020, Xintao Wu, Chris Jermaine, Li Xiong, Xiaohua Hu, Olivera Kotevska, Siyuan Lu, Weija Xu, Srinivas Aluru, Chengxiang Zhai, Eyhab AlMasri, Zhiyuan Chen, and Jeff Saltz (Eds.). IEEE, 631–636. https://doi.org/10. 1109/BigData50022.2020.9378019
[24] Daniel Yue Zhang, Lanyu Shang, Biao Geng, Shuyue Lai, Ke Li, Hongmin Zhu, Md. Tanvir Al Amin, and Dong Wang. 2018. FauxBuster: A Content-free Fauxtography Detector Using Social Media Comments. In IEEE International Conference on Big Data, Big Data 2018, Seattle, WA, USA, December 10-13, 2018, Naoki Abe, Huan Liu, Calton Pu, Xiaohua Hu, Nesreen K. Ahmed, Mu Qiao, Yang Song, Donald Kossmann, Bing Liu, Kisung Lee, Jiliang Tang, Jingrui He, and Jeffrey S. Saltz (Eds.). IEEE, 891–900. https://doi.org/10.1109/BigData.2018.8622344
[25] SIGIR 2023, MR2: A Benchmark for Multimodal Retrieval-Augmented Rumor Detection in Social Media
[26] Dan Saattrup Nielsen and Ryan McConville. 2022. MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, Enrique Amigó, Pablo Castells, Julio Gonzalo, Ben Carterette, J. Shane Culpepper, and Gabriella Kazai (Eds.). ACM, 3141–3153. https://doi.org/10.1145/3477495.3531744
[27] Zhiwei Jin, Juan Cao, Han Guo, Yongdong Zhang, and Jiebo Luo. 2017. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM international conference on Multimedia. 795–816
[28] Rami Aly, Zhijiang Guo, Michael Sejr Schlichtkrull, James Thorne, Andreas Vlachos, Christos Christodoulopoulos, Oana Cocarascu, and Arpit Mittal. 2021. FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/
68d30a9594728bc39aa24be94b319d21-Abstract-round1.html
[29] Xuming Hu, Zhijiang Guo, Guanyu Wu, Aiwei Liu, Lijie Wen, and Philip S. Yu. 2022. CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2022, Seattle, WA, United States, July 10-15, 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Iván Vladimir Meza Ruíz (Eds.). Association for Computational Linguistics, 3362–3376. https://doi.org/10.18653/v1/2022.naacl-main.246
[30] Identifying Checkworthy CURE Claims on Twiter, www 2023;
[31] Amelie Wührl and Roman Klinger. 2021. Claim Detection in Biomedical Twitter Posts. In Proceedings of the 20th Workshop on Biomedical Language Processing. 131–142.
[32] MultiFC: A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims, EMNLP 2019,
[33] Cody Buntain and Jennifer Golbeck. 2017. Automatically identifying fake news in popular twitter threads.In 2017 IEEE International Conference on Smart Cloud (SmartCloud), pages 208–215. IEEE.
[34] Parth Patwa, Shivam Sharma, Srinivas Pykl, Vineeth Guptha, Gitanjali Kumari, Md Shad Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. 2021. Fighting an infodemic: Covid-19 fake news dataset. In International Workshop on Combating Online Hostile Posts in Regional Languages during Emergency Situation, pages 21–29. Springer.
[35] Kadhim Hayawi, Sakib Shahriar, Mohamed Adel Serhani, Ikbal Taleb, and Sujith Samuel Mathew. 2022. Anti-vax: a novel twitter dataset for covid-19 vaccine misinformation detection. Public health, 203:23–30.
[36] COSMOS: Catching Out-of-Context Misinformation with Self-Supervised Learning,
[37] Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. MM-COVID: A Multilingual and Multidimensional Data Repository for CombatingCOVID-19 Fake New. arXiv e-prints (2020), arXiv–2011.
[38] X. Zhou, A. Mulay, E. Ferrara, and R. Zafarani, “Recovery: A multimodal repository for covid-19 news credibility research,” 2020.
[39] Social Network Analysis and Mining (2021), CHECKED: Chinese COVID‑19 fake news dataset;
[40] Erxue Min, Yu Rong, Yatao Bian, Tingyang Xu, Peilin Zhao, Junzhou Huang, and Sophia Ananiadou. 2022. Divide-and-Conquer: Post-User Interaction Network for Fake News Detection on Social Media. In Proceedings of the ACM Web Conference 2022 (Virtual Event, Lyon, France) (WWW ’22). Association for Computing Machinery, New York, NY, USA, 1148–1158. https://doi.org/10.1145/3485447.3512163
博主目前暂时收集和整理了这些数据集,“其他”类别中的数据集还没有进行详细分析,同时还有许多不完善的地方。
如有补充或者错漏敬请大家在评论区指正。
会持续更新,对数据集进行详细介绍,提供相应的数据集下载地址。