本实验中采用OHSUMED测试数据集合(其也被用于第9 届国际文本检索竞赛TREC9 的文档过滤子竞赛)。OHSUMED 数据集合由William Hersh和他的同事们一起建立,其文档来源于医药信息数据库MEDLINE10,它包含了从1987 年到1991 年五年间270 个医药类杂志的标题和/或摘要,包含了348566个文档。一个OHSUMED 文档由8 个域组成,含义如下:
z .I 文章的OHSUMED 序列号,从1 到348566
z .U MEDLINE 标识
z .S 文章来源
z .M MeSH 索引词
z .T 文章标题
z .P 文章类型
z .W 文章摘要
z .A 文章作者
OHSUMED 的作者还为文档集合构造了106 个查询,这些查询来源于医生在给病人看病的过程中所提交的查询字符串,每一个查询由两部分组成:病人情况的简单描述和所需信息的描述。一个OHSUMED 查询由如下3 不同域组成:
z .I 文章的OHSUMED 序列号,从1 到106
z .B 患者信息
z .W 信息需求
基于以上的文档集合和查询集合,OHSUMED 一共标注了16140 个查询-文
档对,每一个查询-文档对都被标注成相关(definitely relevant)、部分相关(partially relevant)或者不相关(not relevant),最终的标注结果中一共包含了2557个相关、2932 个部分相关以及12498 个不相关的查询-文档对(一个文档可能被标记成多个级别,在本节的实验中,取其级别最高的标号作为其最终标号)。
Here are the files, their uncompressed size, and a description of their content:
1) ohsumed.87 (60,303,307) — Contains the MEDLINE documents for the year 1987. The format for each of the MEDLINE document files follows the conventions of the SMART system, with each field defined as below (NLM designator in parentheses):
.I sequential identifier
.U MEDLINE identifier (UI)
.M Human-assigned MeSH terms (MH)
.T Title (TI)
.P Publication type (PT)
.W Abstract (AB)
.A Author (AU)
.S Source (SO)
(Note: Some references have their abstracts truncated at 250 words, while some have no abstracts at all.)
2) ohsumed.88 (78,585,929) — Contains the MEDLINE documents for the year 1988, formatted as above.
3) ohsumed.89 (84,719,077) — Contains the MEDLINE documents for the year 1989, formatted as above.
4) ohsumed.90 (86,754,890) — Contains the MEDLINE documents for the year 1990, formatted as above.
5) ohsumed.91 (89,761,122) — Contains the MEDLINE documents for the year 1991, formatted as above.
6) queries (11,591) — Contains the 106 queries in test set, with patient and topic information, in the format:
.I Sequential identifier
.B Patient information
.W Information request
7) drel.ui (26,919) — Contains the query-document pairs rated as definitely relevant, with documents listed by MEDLINE UI, in the format:
8) drel.i (21,709) — Contains the query-document pairs rated as definitely relevant, with documents listed by sequential number (from the .I field), in the format:
9) pdrel.ui (57,831) — Contains the query-doc pairs rated as definitely or possibly relevant, with documents listed by MEDLINE UI, in the format:
10) pdrel.i (46,664) — Contains the query-doc pairs rated as definitely or possibly relevant, with documents listed by sequential number (from the .I field), in the format:
11) judged (368,366) — Contains a list of all retrieved documents by any of the five original searchers or SMART, sorted first by query number and then document number, along with their relevance judgments. The relevance judgments are either d (definitely relevant), p (possibly relevant), or n (not relevant). The relevance1 judgment is the original relevance judgment done on the documents retrieved by the original searchers. The relevance 2 judgment is the second relevance judgment done to assess interobserver reliability of the relevance1 judgments. The relevance3 judgment is the relevance judgment done on documents retrieved by SMART but not the original searchers, or another relevance judgment on an originally retrieved document to assess interobserver reliability.
[][]
12) ui (3,137,094) — Contains the MEDLINE UI’s for all 348,566 documents in test database, listed one per line.
13) readme — This file.
http://ir.ohsu.edu/ohsumed/ohsumed.html