根据大量的证据来核实一项声明。我们将声明作为输入,证明声明的相关文档段落作为输出。
FEVER是一个针对文本来源进行事实提取和验证的公开数据集。FEVER(事实提取和验证)由185,445个声明组成,通过修改从维基百科中提取的句子,然后在不知道这些句子的情况下进行验证。声明被分为 Supported,Refuted,NotRnoughInfo,数据是 Json 格式。
CLIMATE-FEVER是一个采用FEVER方法的数据集,包含1535个关于气候变化的真实声明。每个声明都附有五个从维基百科检索的人工注释证据句子,这些句子支持、反驳或没有提供足够的信息来验证该声明。整个数据集包含 7675 声明-证据对。此外,该数据集还包括涉及多个方面的具有挑战性的声明和同时存在支持和反驳证据的有争议的声明情况。
声明被分割成了 claims_train.jsonl
, claims_dev.jsonl
, and claims_test.jsonl
,每个声明占一行。证据文档数据集是 corpus.jsonl
,每个证据文档占一行。
引用是科学文章之间相关性的关键信号。在此任务中,模型试图为给定的查询论文标题(输入)检索被引用论文(输出)
SciDocs一个benchmark,包括从引文预测到文献分类和推荐的七个文献级任务
重复问题检索是在社区问答论坛中识别重复问题的任务。给定的查询作为输入,重复问题是输出。
它包含来自12个StackExchange子论坛的线程,标注有重复的问题信息。为检索和分类实验提供了预定义的训练和测试分割,以确保使用该集合的不同研究之间的最大可比性。此外,它还附带了一个脚本,用于以各种方式操作数据。
我们发布的数据集将围绕与Quora相关的各种问题展开,并且提供给多个领域的研究者。数据集包含400000行潜在的问题重复对。
Argument retrieval是根据它们与不同主题的文本查询(输入)的相关性,对关注的论据(输出)集合中的论据文本进行排序的任务。(针对主题的相关性排序)
个人观点:就是给出一个争议性(可辩论)的话题,然后找到一些与之相关的证据,话题,观点论述。
The ArguAna Counterargs Corpus:一个用于学习检索一个论点的最佳反论点的英文数据集。包含6753对论点与最佳反论点,源于辩论网站 idebate.org,还有不同的实验文件,多达百万个候选对。
给定一个关于有争议话题的问题,从在线辩论门户的爬取中检索出相关论证。
1
Is climate change real?
You read an opinion piece on how climate change is a hoax and disagree. Now you are looking for arguments supporting the claim that climate change is in fact real.
Relevant arguments will support the given stance that climate change is real or attack a hoax side's argument.
给定一个新闻标题,我们检索提供重要上下文或背景信息的相关新闻文章。
TREC News Track 以新闻领域的现代搜索任务为特点。TREC Washington Post Corpus包含从2012年1月到2020年12月的728,626篇新闻文章和博客文章。
- title
- byline
- date of publication
- kicker (a section header)
- article text broken into paragraphs
- links to embedded images and multimedia (for 2012-2017 documents)
开放领域问答是在没有答案的预先定义位置的情况下,检索一个问题正确答案的任务。在开放领域任务中,模型必须在整个知识库(eg.Wikipeida)中进行检索。将问题作为输入,包含答案的段落作为输出。
是一个 QA 数据集,包含 307373 训练样本,7830 提升样本,7842 测试样本。每个示例由google.com查询和相应的Wikipedia页面组成。每个维基百科页面上都有一段(或长答案)注释的回答问题的段落,以及一段或多小段注释的包含实际答案的 span。长回答和短回答注释可以为空。如果它们都是空的,那么这一页上就没有答案了。如果长回答标注非空,而短回答标注为空,则标注的文章回答了问题,但找不到显式的短答案。最后,有1%的文档在一段文字上标注了“yes”或“no”的简短答案,而不是一串很短的跨度。
从英文维基百科收集的 QA 数据集,包含大约113K个群众来源的问题,这些问题需要两篇维基百科文章的引言段落来回答。数据集中的每个问题都有两个黄金段落,以及这些段落中的句子列表,众包工作者认为这些句子是回答问题所必需的支持事实。
关于金融的问答。一共有两个任务
基于方面的金融语义分析:给定一个英文金融领域的文本实例(微博消息、新闻声明或标题),检测文本中提到的目标方面(从预定义的方面类列表中),并预测每个提到的目标的情感得分。
在金融数据上基于选择的QA:给定来自不同英文金融数据源(微博、报告、新闻)的结构化和非结构化文本文档的语料库,构建一个回答自然语言问题的问答系统。
“question”: “
Why are big companies like Apple or Google not included in the Dow Jones Industrial Average (DJIA) index?
”,
“answers”:{
“290156”: { “text”:" That is a pretty exclusive club and for the most part they are not interested in highly volatile companies like Apple and Google. Sure, IBM is part of the DJIA, but that is about as stalwart as you can get these days. The typical profile for a DJIA stock would be one that pays fairly predictable dividends, has been around since money was invented, and are not going anywhere unless the apocalypse really happens this year. In summary, DJIA is the boring reliable company index." ," timestamp": “Sep 11 '12 at 0:53”}
}
地址:FiQA - 2018 (google.com)
Twitter是一个微博客网站,人们发表实时的关于对一些主题选择的信息,并且讨论当下的问题。将新闻标题作为输入,检索相关tweets作为输出。
该数据集由Signal AI发布,以方便对新闻文章进行研究。它包含100万篇以英语为主的文章,但也包括非英语和多种语言的文章。这些文章的来源除了当地新闻来源和博客外,还包括路透社等主要媒体。
- id: a unique identifier for the article
- title: the title of the article
- content: the textual content of the article (may occasionally contain HTML and JavaScript content)
- source: the name of the article source (e.g. Reuters)
- published: the publication date of the article
- media-type: either “News” or “Blog”
生物医学信息检索是针对生物医学领域中给定的科学查询搜索相关的科学文档,如研究论文或博客。我们将科学查询作为输入,检索生物医学文档作为输出。
NFCorpus是一个用于医学信息检索的全文英文检索数据集。包含了3244个自然语言查询,对9964个医疗文件(用复杂的术语密集型语言编写)自动提取了169,756个相关判断。
BioASQ是一个问答数据集。BioASQ数据集中的实例由问题(Q)、人工标注的答案(A)和相关上下文©(也称为片段)组成。
一共分为5轮,每一轮都会有一个CORD-19数据集,和一系列信息需求声明(主题)。在一轮提交截止日期之后,NIST使用提交的运行为每个主题生成一组文档,由人工注释人员评估与主题的相关性。
实体检索需要检索查询中提到的实体的唯一维基百科页面(通过实体来检索网页,用来介绍实体)。这对包含实体链接的任务是很重要的。承载实体的查询是输入,实体摘要和标题作为输出被检索。
致力于从维基百科项目创造的信息中提取结构化内容。DBpedia允许用户从语义上查询Wikipedia资源的关系和属性,包括到其他相关数据集的链接。
Dataset | Website (Link) |
---|---|
MSMARCO | https://microsoft.github.io/msmarco/ |
TREC-COVID | https://ir.nist.gov/covidSubmit/index.html |
NFCorpus | https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/ |
BioASQ | http://bioasq.org |
NQ | https://ai.google.com/research/NaturalQuestions |
HotpotQA | https://hotpotqa.github.io |
FiQA-2018 | https://sites.google.com/view/fiqa/ |
Signal-1M (RT) | https://research.signal-ai.com/datasets/signal1m-tweetir.html |
TREC-NEWS | https://trec.nist.gov/data/news2019.html |
ArguAna | http://argumentation.bplaced.net/arguana/data |
Touchè-2020 | https://webis.de/events/touche-20/shared-task-1.html |
CQADupStack | http://nlp.cis.unimelb.edu.au/resources/cqadupstack/ |
Quora | https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs |
DBPedia-Entity | https://github.com/iai-group/DBpedia-Entity/ |
SCIDOCS | https://allenai.org/data/scidocs |
FEVER | http://fever.ai |
Climate-FEVER | http://climatefever.ai |
SciFact | https://github.com/allenai/scifact |
Table 6: Original dataset website (link) for all datasets present in beir.
Dataset | Query | Relevant-Document |
---|---|---|
MSMARCO | what fruit is native to australia | Passiflora herbertiana. A rare passion fruit native to Australia. Fruits are green-skinned, white fleshed, with an unknown edible rating. Some sources list the fruit as edible, sweet and tasty, while others list the fruits as being bitter and inedible. assiflora herbertiana. A rare passion fruit native to Australia… |
TREC-COVID | what is the origin of COVID-19 | Origin of Novel Coronavirus (COVID-19): A Computational Biology Study using Artificial Intelligence Origin of the COVID-19 virus has been intensely debated in the community… |
BioASQ | What is the effect of HMGB2 loss on CTCF clustering | HMGB2 Loss upon Senescence Entry Disrupts Genomic Organization and Induces CTCF Clustering across Cell Types. Processes like cellular senescence are characterized by complex events giving rise to heterogeneous cell populations. However, the early molecular events driving this cascade remain elusive…. |
NFCorpus | Titanium Dioxide & Inflammatory Bowel Disease | Titanium Dioxide Nanoparticles in Food and Personal Care Products Titanium dioxide is a common additive in many food, personal care, and other consumer products used by people, which after use can enter the sewage system, and subsequently enter the environment as treated effluent discharged to surface waters or biosolids applied to agricultural land, or incinerated wastes… |
NQ | when did they stop cigarette advertising on television? | Tobacco advertising The first calls to restrict advertising came in 1962 from the Royal College of Physicians, who highlighted the health problems and recommended stricter laws… |
HotpotQA | Stockely Webster has paintings hanging in what home (that serves as the residence for the Mayor of New York)? | Stokely Webster Stokely Webster (1912 – 2001) was best known as an American impressionist painter who studied in Paris. His paintings can be found in the permanent collections of many museums, including the Metropolitan Museum of Art in New York, the National Museum… |
FiQA-2018 | What is the PEG ratio? How is the PEG ratio calculated? How is the PEG ratio useful for stock investing? | PEG is Price/Earnings to Growth. It is calculated as Price/Earnings/Annual EPS Growth. It represents how good a stock is to buy, factoring in growth of earnings, which P/E does not. Obviously when PEG is lower, a stock is more undervalued, which means that it is a better buy, and more likely… |
Signal-1M (RT) | Genvoya, a Gentler Anti-HIV Cocktail, Okayed by EU Regulators | All people with #HIV should get anti-retroviral drugs: @WHO, by @kkelland via @Reuters_Health #AIDS #TasP |
TREC-NEWS | Websites where children are prostituted are immune from prosecution. But why? | Senate launches bill to remove immunity for websites hosting illegal content, spurred by Backpage.com The legislation, along with a similar bill in the House, sets the stage for a battle between Congress and some of the Internet’s most powerful players, including Google and various free-speech advocates, who believe that Congress shouldn’t regulate Web content or try to force websites to police themselves more rigorously… |
ArguAna | Sexist advertising is subjective so would be too difficult to codify. Effective advertising appeals to the social, cultural, and personal values of consumers. Through the connection of values to products, services and ideas, advertising is able to accomplish its goal of adoption… | media modern culture television gender house would ban sexist advertising Although there is a claim that sexist advertising is to difficult to codify, such codes have and are being developed to guide the advertising industry. These standards speak to advertising which demeans the status of women, objectifies them, and plays upon stereotypes about women which harm women and society in general. Earlier the Council of Europe was mentioned, Denmark, Norway and Australia as specific examples of codes or standards for evaluating sexist advertising which have been developed. |
Tóuche-2020 | Should the government allow illegal immigrants to become citizens? | America should support blanket amnesty for illegal immigrants. Undocumented workers do not receive full Social Security benefits because they are not United States citizens " nor should they be until they seek citizenship legally. Illegal immigrants are legally obligated to pay taxes… |
CQADupStack | Command to display first few and last few lines of a file | Combing head and tail in a single call via pipe On a regular basis, I am piping the output of some program to either ‘head‘ or ‘tail‘. Now, suppose that I want to see the first AND last 10 lines of piped output, such that I could do something like ./lotsofoutput | headtail… |
Quora | How long does it take to methamphetamine out of your blood? | How long does it take the body to get rid of methamphetamine? |
DBPedia | Paul Auster novels | The New York Trilogy The New York Trilogy is a series of novels by Paul Auster. Originally published sequentially as City of Glass (1985), Ghosts (1986) and The Locked Room (1986), it has since been collected into a single volume. |
SCIDOCS | CFD Analysis of Convective Heat Transfer Coefficient on External Surfaces of Buildings | Application of CFD in building performance simulation for the outdoor environment: an overview This paper provides an overview of the application of CFD in building performance simulation for the outdoor environment, focused on four topics… |
FEVER | DodgeBall: A True Underdog Story is an American movie from 2004 | DodgeBall: A True Underdog Story DodgeBall: A True Underdog Story is a 2004 American sports comedy film written and directed by Rawson Marshall Thurber and starring Vince Vaughn and Ben Stiller. The film follows friends who enter a dodgeball tournament… |
Climate-FEVER | Sea level rise is now increasing faster than predicted due to unexpectedly rapid ice melting. | Sea level rise A sea level rise is an increase in the volume of water in the world ’s oceans, resulting in an increase in global mean sea level. The rise is usually attributed to global climate change by thermal expansion of the water in the oceans and by melting of Ice sheets and glaciers… |
Table 7: Examples of queries and relevant documents for all datasets included in beir. () and () are used to distinguish the title separately from the paragraph within a document in the table above. These tokens were not passed to the respective models.
ad-hoc信息检索特别指的是基于文本的检索,其中集合中的文档保持相对静态,新的查询不断提交到系统中
Dataset | Genre | #Query | #Collections |
---|---|---|---|
Robust04 | news | 250 | 0.5M |
ClueWeb09-Cat-B | web | 150 | 50M |
Gov2 | .gov pages | 150 | 25M |
MS MARCO (Document Ranking) | web pages | 367,013 | 3.2M |
MQ2007 | .gov pages | 1692 | 25M |
MQ2008 | .gov pages | 794 | 25M |
Robust04
:包含0.5 million文档,总共有 250个查询,来自 TREC Robust Track 2004。Cluebweb09
:大的 web 集合,包含 34 million 文档。总共有150个查询,来自 TREC Web Tracks 2009, 2010, and 2011Gov2
:是大的 web 集合,网页爬取自 .gov,包含 25 million文档。总共有 150个查询,来自 TREC Terabyte Tracks 2004, 2005, and 2006MS MARCO
:从 Bing 的搜索日志中提供了大量的信息问题式查询。这些段落由人类用相关/不相关的标签注释。总共有8841822份文档。分别有808,731个查询,6,980个查询和48,598个查询用于训练,验证和测试Million Query TREC 2007
:是一个使用Gov2 web集合的LETOR基准数据集。有 1692 个查询和65323个标注文档Million Query TREC 2008
:是另一个LETOR基准数据集,它也使用Gov2 web集合。有 784个查询和 14383个标注文档Community Question Answer是在针对给定问题提供的众多答案中自动搜索相关答案(答案选择),并搜索相关问题重用其已有答案(问题检索)。
Dataset | Domain | #Question | #Answer |
---|---|---|---|
TRECQA | Open-domain | 1,229 | 5,3417 |
WikiQA | Open-domain | 3,047 | 29,258 |
InsuranceQA | Insurance | 12,889 | 21,325 |
FiQA | Financial | 6,648 | 57,641 |
Yahoo! Answers | Open-domain | 50,112 | 253,440 |
SemEval-2015 Task 3 | Open-domain | 2,600 | 16,541 |
SemEval-2016 Task 3 | Open-domain | 4,879 | 36,198 |
SemEval-2017 Task 3 | Open-domain | 4,879 | 36,198 |
TRECQA
数据集是由Wang等人根据TRECQA track 8-13的数据创建的,候选答案自动从每个问题的文档池中选择,使用了重叠不间断的单词计数和模式匹配的组合WikiQA
是一组公开的问题和句子对,由Microsoft Research收集并注释,用于研究开放领域的问题回答。InsuranceQA
是一个来自保险领域的非事实性QA数据集。问题可能有多个正确答案,通常问题比答案短得多。对于开发和测试集中的每个问题,都有500个候选答案。FiQA
是一个来自金融领域的非事实性QA数据集,最近为WWW 2018挑战发布(前面出现过)Yahoo!Answers
是一个人们发布问题和答案的网站,所有这些对任何愿意浏览或下载它们的网络用户都是公开的。在这个数据集中,答案的长度相对长于TrecQA和WikiQA。SemEval-2015 Task 3
包含两个子任务。
SemEval-2016 Task 3
包括两个子任务,即问题-评论相似度和问题-问题相似度。
SemEval-2017 Task 3
包含两个子任务,即问题相似度和相关性分类。给定新问题和集合中的一组相关问题,问题相似度任务是根据与原始问题的相似度对相似问题进行排序。而相关性分类则是根据答案帖子与问题的相关性,基于问答线程对答案帖子进行排序。Natural Language Inference是给定前提的情况下,决定一个假设是 true (entailment), false (contradiction), or undetermined (neutral) 的任务。
Dataset | # sentence pair |
---|---|
SNLI | 570K |
MultiNLI | 433K |
SciTail | 27K |
SNLI
是斯坦福自然语言推理(Stanford Natural Language Inference)的缩写,它有570k对人工注释的句子对。前提数据来自 Flickr30k语料库的字幕,手工合成假设数据。MultiNLI
是Multi-Genre NLI的缩写,它有433k句对,其收集过程和任务细节与SNLI密切相关。前提数据从最广泛的美国英语类型中收集,如非虚构类型,口语类型,不太正式的书面类型(小说,信件)和专门的9/11类型。SciTail
蕴涵数据集由27k组成。与SNLI和mnli不同的是,它不是来自于人群,而是根据已经存在于“wild”的句子创建的。假设是由科学问题和相应的答案候选词创建的,而使用来自大型语料库的相关网络句子作为前提。Paraphrase Identification是决定两个句子是否具有相同意思的任务。
Dataset | pairs of sentence |
---|---|
MRPC | 5800 |
STS | 1750 |
SICK-R | 9840 |
SICK-E | 9840 |
Quora Question Pair | 404290 |
MRPC
是Microsoft Research Paraphrase Corpus的缩写。它包含了5800对句子,这些句子都是从网上的新闻来源中提取出来的,并附有注释,表明每对句子是否捕捉到了释义/语义等价关系。SentEval
包含语义关联数据集,包括SICK和STS基准数据集。SICK数据集包括两个子任务SICK- R和SICK- E。对于STS和SICK-R,它学会预测两个句子之间的相关度分数,对于SICK-E,它有与SICK-R相同的句子对,但可以被视为一个三类分类问题(类是“蕴涵”、“矛盾”和“中性”)。Quora Question Pairs
是Quora发布的一个任务,旨在识别重复的问题。它由Quora上的40多万对问题组成,每对问题都有一个二值的注释,表明这两个问题是否互相改述。(前面提到过了)Response retrieval/selection旨在从对话库中排序/选择一个合适的回答。自动对话(AC)旨在创建一个自动的人机对话过程,以实现问题回答、任务完成和社交聊天的目的。一般来说,AC可以被表述为一个IR问题,目的是从对话存储库中排列/选择一个适当的响应,也可以被表述为一个生成问题,目的是针对输入语句生成一个适当的响应。在这里,我们将响应检索称为基于ir的进行交流的方法。
Dataset | Partition | #Context Response pair | #Candidate per Context | Positive:Negative | Avg #turns per context |
UDC | train/validation/test | 1M/500k/500k | 2/10/10 | 1:1/1:9/1:9 | 10.13/10.11/10.11 |
Douban | train/validation/test | 1M/50k/10k | 2/2/10 | 1:1/1:1/1.18:8.82 | 6.69/6.75/6.45 |
MSDialog | train/validation/test | 173k/37k/35k | 10/10/10 | 1:9/1:9/1:9 | 5.0/4.9/4.4 |
EDC | train/validation/test | 1M/10k/10k | 2/2/10 | 1:1/1:1/1:9 | 5.51/5.48/5.64 |
Persona-Chat dataset | 8939/1000/968 | 20/20/20 | 1:19/1:19/1:19 | 7.35/7.80/7.76 | |
CMUDoG dataset | 2881/196/537 | 20/20/20 | 1:19/1:19/1:19 | 12.55/12.37/12.36 |
Ubuntu Dialog Corpus
(UDC)包含从Ubuntu论坛的聊天日志收集的多回合对话。该数据集包含100万对用于训练的上下文-响应对、50万对用于验证的上下文-响应对和50万对用于测试的上下文-响应对。积极的响应是来自人类的真实反应,而消极的响应是随机抽样的。在训练中正样本和负样本的比例是1:1,在验证和测试中是1:9
Douban Conversation Corpus
是由豆瓣组构造的开放域数据集。该数据集由100万个用于训练的上下文-响应对、50k对用于验证的上下文-响应对和10k对用于测试的上下文-响应对组成,每个上下文分别对应2个、2个和10个候选响应。测试集中的候选响应检索自新浪微博,通过人工评审进行标记MSDialog
是来自微软产品在线论坛(微软社区)的信息搜索者和答案提供者之间的问答(QA)交互的标签对话集。该数据集包含超过2000个多回合信息寻求对话,包含10,000个话语,这些话语在话语层面上带有用户意图的注释。E-commerce Dialogue Corpus
包含基于超过20个商品的五种以上类型的对话(商品咨询、物流快递、推荐)。在训练与验证时正样本:负样本=1:1,测试时是1:9Persona-Chat dataset
CMUDoG dataset
:我们将“基于文档的对话”定义为关于指定文档内容的对话。在这个数据集中,指定的文档是维基百科关于流行电影的文章。该数据集包含4112个对话,平均每个对话21.43个回合。这使得该数据集不仅在生成响应时提供相关的聊天历史,而且还提供模型可以使用的信息源