ACE2005多语言训练语料
Item Name: | ACE 2005 Multilingual Training Corpus |
Author(s): | Christopher Walker, Stephanie Strassel, Julie Medero, Kazuaki Maeda |
LDC Catalog No.: | LDC2006T06 |
ISBN: | 1-58563-376-3 |
ISLRN: | 458-031-085-383-4 |
Release Date: | February 15, 2006 |
Member Year(s): | 2006 |
DCMI Type(s): | Text |
Data Source(s): 数据源: |
weblogs, broadcast news, newsgroups, broadcast conversation 微博,广播新闻,新闻组,广播对话 |
Project(s): | ACE |
Application(s): 应用: |
automatic content extraction 自动内容抽取 |
Language(s): 语言: |
Mandarin Chinese, Standard Arabic, English 普通话中文、标准阿拉伯语、英语 |
Language ID(s): | cmn, arb, eng |
License(s): | LDC User Agreement for Non-Members |
Online Documentation: | LDC2006T06 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Walker, Christopher, et al. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006. |
介绍
ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.
ACE 2005多语种培训语料库包含完整的英语、阿拉伯语和汉语训练数据,用于2005年自动内容提取(ACE)技术评估。语料库由多种类型的数据组成包括实体、关系和事件,这些数据由语言数据联盟(LDC)标注,并得到ACE计划的支持和LDC的额外援助。
The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form.
ACE项目的目标是开发自动内容提取技术,用以支持人类语言文本形式的自动处理。
In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks.
2005年11月,对站点进行了五个主要方面的系统性能评估:实体的识别、值、时间表达式、关系和事件。实体、关系和事件提及检测也作为诊断任务提供。除事件任务外,所有任务均使用英语、汉语和阿拉伯语三种语言执行。事件任务任务仅用英文和中文进行评估。这个版本包括这些评价任务的官方培训数据。
For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's ACE website.
有关ACE项目语言资源的更多信息,包括注释指南、任务定义和其他文档,请参见LDC的ACE网站。
数据
Below is information about the amount of data in this release and its annotation status.
下面是关于此版本中的数据量及其注释状态的信息。
---------------------
对1P,DUAL,ADJ, NORM的解释(来自:原文:https://blog.csdn.net/taolusi/article/details/80812597 作者:taolusi )
adj、fp1、fp2、timex2norm文件夹分别表示的是不同的标注过程。ACE语料在所有任务上都是通过两个独立工作的标注器来进行标注的。第一轮的标注成为1P,与之独立的双重第一轮标注成为DUAL。对于1P和DUAL来说,一个标注器完成文件的所有任务。文件是通过自动标注工作流程系统(Annotation Work-flow System, AWS)来进行分配的,而且文件分配是双盲的。Note:1P和DUAL在文件夹里都是以'fp1'和'fp2'来存放的,也就是说1P和fp1对应,DUAL和fp2对应。每个文件的1P和DUAL版本之间的差异由资深标注员或者小组负责人来进行裁决,从而得到一个高质量的gold standard文件。gold standard裁决文件被成为ADJ(也就是我们上边说的ADJ文件夹)。在裁决之后,TIMEX2值被标准化处理以后得到NORM。这个语料中的所有数据集都已经被NORM标注。
---------------------
English | |||||||||
words | files | ||||||||
1P | DUAL | ADJ | NORM | 1P | DUAL | ADJ | NORM | ||
NW | 60658 | 57807 | 33459 | 48399 | 128 | 124 | 81 | 106 | |
BN | 59239 | 58144 | 52444 | 55967 | 239 | 234 | 217 | 226 | |
BC | 46612 | 46110 | 33874 | 40415 | 68 | 67 | 52 | 60 | |
WL | 45210 | 43648 | 35529 | 37897 | 127 | 122 | 114 | 119 | |
UN | 45161 | 44473 | 26371 | 37366 | 58 | 57 | 37 | 49 | |
CTS | 47003 | 47003 | 34868 | 39845 | 46 | 46 | 34 | 39 | |
Total | 303833 | 297185 | 216545 | 259889 | 666 | 650 | 535 | 599 |
Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word. 注:中文数据以字符表示,我们假设对应大约1.5个字符/单词 |
|||||||||
chars | files | ||||||||
1P(完整)注释 | DUAL对偶 | ADJ争议裁决 | 1P | DUAL | ADJ | ||||
NW新闻专线 | 127319 | 124175 | 121797 | 248 | 242 | 238 | |||
BN广播新闻 | 134963 | 133696 | 120513 | 332 | 328 | 298 | |||
WL微博 | 71839 | 68063 | 65681 | 107 | 101 | 97 | |||
Total | 334121 | 325834 | 307991 | 687 | 671 | 633 |
Arabic | |||||||||
words | files | ||||||||
1P | DUAL | ADJ | 1P | DUAL | ADJ | ||||
NW | 61287 | 56158 | 53026 | 239 | 226 | 221 | |||
BN | 29259 | 27165 | 26907 | 134 | 128 | 127 | |||
WL | 21687 | 20181 | 20181 | 60 | 55 | 55 | |||
Total | 112233 | 103504 | 100114 | 433 | 409 | 403 |
For examples of the data in this publication, please review the following samples: