ACE 2005数据集(介绍2)

以下内容来自https://catalog.ldc.upenn.edu/LDC2006T06

ACE 2005 Multilingual Training Corpus

ACE2005多语言训练语料

Item Name: ACE 2005 Multilingual Training Corpus
Author(s): Christopher Walker, Stephanie Strassel, Julie Medero, Kazuaki Maeda
LDC Catalog No.: LDC2006T06
ISBN: 1-58563-376-3
ISLRN: 458-031-085-383-4
Release Date: February 15, 2006
Member Year(s): 2006
DCMI Type(s): Text

Data Source(s):

数据源:

weblogs, broadcast news, newsgroups, broadcast conversation

微博,广播新闻,新闻组,广播对话

Project(s): ACE

Application(s):

应用:

automatic content extraction

自动内容抽取

Language(s):

语言:

Mandarin Chinese, Standard Arabic, English

普通话中文、标准阿拉伯语、英语

Language ID(s): cmn, arb, eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2006T06 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Walker, Christopher, et al. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006.

Introduction

介绍

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

ACE 2005多语种培训语料库包含完整的英语、阿拉伯语和汉语训练数据,用于2005年自动内容提取(ACE)技术评估。语料库由多种类型的数据组成包括实体、关系和事件,这些数据由语言数据联盟(LDC)标注,并得到ACE计划的支持和LDC的额外援助。

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form.

ACE项目的目标是开发自动内容提取技术,用以支持人类语言文本形式的自动处理。

In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks.

2005年11月,对站点进行了五个主要方面的系统性能评估:实体的识别、值、时间表达式、关系和事件实体、关系和事件提及检测也作为诊断任务提供。除事件任务外,所有任务均使用英语、汉语和阿拉伯语三种语言执行。事件任务任务仅用英文和中文进行评估。这个版本包括这些评价任务的官方培训数据。

For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's ACE website.

有关ACE项目语言资源的更多信息,包括注释指南、任务定义和其他文档,请参见LDC的ACE网站。

Data

数据

Below is information about the amount of data in this release and its annotation status.

下面是关于此版本中的数据量及其注释状态的信息。

  • 1P: data subject to first pass (complete) annotation
  • 1P: 须先通过(完整)注释的资料
  • DUAL: data also subject to dual first pass (complete) annotation
  • 对偶:数据也服从对偶第一遍(完整)注释
  • ADJ: data also subject to discrepancy resolution/adjudication
  • ADJ: 资料也有经争议解决/裁定
  • NORM: data also subject to TIMEX2 normalization
  • NORM: 数据也要服从TIMEX2标准化 

--------------------- 
对1P,DUAL,ADJ, NORM的解释(来自:原文:https://blog.csdn.net/taolusi/article/details/80812597  作者:taolusi )

adj、fp1、fp2、timex2norm文件夹分别表示的是不同的标注过程。ACE语料在所有任务上都是通过两个独立工作的标注器来进行标注的。第一轮的标注成为1P,与之独立的双重第一轮标注成为DUAL。对于1P和DUAL来说,一个标注器完成文件的所有任务。文件是通过自动标注工作流程系统(Annotation Work-flow System, AWS)来进行分配的,而且文件分配是双盲的。Note:1P和DUAL在文件夹里都是以'fp1'和'fp2'来存放的,也就是说1P和fp1对应,DUAL和fp2对应。每个文件的1P和DUAL版本之间的差异由资深标注员或者小组负责人来进行裁决,从而得到一个高质量的gold standard文件。gold standard裁决文件被成为ADJ(也就是我们上边说的ADJ文件夹)。在裁决之后,TIMEX2值被标准化处理以后得到NORM。这个语料中的所有数据集都已经被NORM标注。
--------------------- 

 

English
words files
1P DUAL ADJ NORM 1P DUAL ADJ NORM
NW 60658 57807 33459 48399 128 124 81 106
BN 59239 58144 52444 55967 239 234 217 226
BC 46612 46110 33874 40415 68 67 52 60
WL 45210 43648 35529 37897 127 122 114 119
UN 45161 44473 26371 37366 58 57 37 49
CTS 47003 47003 34868 39845 46 46 34 39
Total 303833 297185 216545 259889 666 650 535 599

Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.

注:中文数据以字符表示,我们假设对应大约1.5个字符/单词

chars files
  1P(完整)注释 DUAL对偶 ADJ争议裁决 1P DUAL ADJ
NW新闻专线 127319 124175 121797 248 242 238
BN广播新闻 134963 133696 120513 332 328 298
WL微博 71839 68063 65681 107 101 97
Total 334121 325834 307991 687 671 633
Arabic
words files
1P DUAL ADJ 1P DUAL ADJ
NW 61287 56158 53026 239 226 221
BN 29259 27165 26907 134 128 127
WL 21687 20181 20181 60 55 55
Total 112233 103504 100114 433 409 403

Samples

For examples of the data in this publication, please review the following samples:

  • English
  • Arabic
  • Chinese
  • ACE 2005数据集(介绍2)_第1张图片

你可能感兴趣的:(自然语言处理)