ACE 2005 Data Prep 数据预处理

以下内容来自https://github.com/mgormley/ace-data-prep

ACE 2005 Data Prep

ACE 2005数据预处理

Description

描述

This project ties together numerous tools. It converts from the ACE 2005 file format (.sgm and .apf.xml files) to Concrete. It also annotates the ACE 2005 data using Stanford CoreNLP and the chunklink.pl script from CoNLL-2000. The data is the same as that used in (Yu, Gormley, & Dredze, NAACL 2015) and (Gormley, Yu, & Dredze, EMNLP 2015). See below for appropriate citations.

本项目整合了多种工具, 转换ACE2005文件格式(.sgm 和 .apf.xml文件)为Concrete。也使用了斯坦福CoreNLP和CoNLL-2000的chunklink.pl脚本标注。数据和(Yu, Gormley, & Dredze, NAACL 2015)及(Gormley, Yu, & Dredze, EMNLP 2015)使用的相同,参阅后面对应的引用。

The output of the pipeline is available in two formats: Concrete and JSON. Concrete is a data serialization format for NLP. See the primer on Concrete for additional details. As a convenience, the output is also converted to an easy-to-parse Concatenated JSON format. This conversion is done by Pacaya NLP. An example sentence is shown below.

流程的输出有两种格式: Concrete 和 JSON。Concrete是一种用于NLP的序列化格式数据。更多的细节可以查看Concrete的原始文档。更为便捷的,输出可以转换为一个统一变换的串联的JSON格式。这个变换由Pacaya NLP来完成(https://github.com/mgormley/pacaya-nlp)。一个句子的示例如下所示:

{"words":["i","'m","wolf","blitzer","in","washington","."]
,"lemmas":["i","be","wolf","blitzer","in","washington","."]
,"posTags":["LS","VBP","JJ","NN","IN","NN","."]
,"chunks":["O","B-VP","B-NP","I-NP","B-PP","B-NP","O"]
,"parents":[3,3,3,-1,3,4,-2]
,"deprels":["nsubj","cop","amod","root","prep","pobj",null]
,"naryTree":"((ROOT (S (NP (LS i)) (VP (VBP 'm) (NP (NP (JJ wolf) (NN blitzer)) (PP (IN in) (NP (NN washington))))) (. .))))"
,"nePairs":"[{\"m1\":{\"start\":2,\"end\":4,\"head\":3,\"type\":null,\"subtype\":null,\"phraseType\":\"NAM\",\"id\":\"db1b9d9c-15cb-f7bb-7ded-00007733280a\"},\"m2\":{\"start\":5,\
,"relLabels":["PHYS(Arg-1,Arg-1)"]}

The words, named-entity pairs (nePairs), and relation labels (relLabels) are given by the original ACE 2005 data. The lemmas, part-of-speech tags (posTags), labeled syntactic dependency parse (parents, deprels), and constituency parse (naryTree) are automatically annotated by Stanford CoreNLP. The chunks are derived from the constituency parse using a python wrapperof the chunklink.pl script from CoNLL-2000.

由原始的ACE2005数据给出词、命名实体对和关系标签语法、词性标注、语法的依存关系和选区解析(naryTree)由斯坦福CoreNLP自动标注。这些块是使用python包从CoNLL-2000的chunklink.pl脚本从选区解析中派生出来的。

After executing make LDC_DIR=./LDC OUT_DIR=./output ace05splits (see details below), the output will consist of the following directories:
执行后,使LDC_DIR=。/ LDC OUT_DIR =。/output ace05split(详见下文),输出将包括以下目录:

  • LDC2006T06_temp_copy/: A copy of the LDC input directory with DTD files placed appropriately.输入路径带有DTD文件,LDC的副本放置恰当的位置。
  • ace-05-comms/: The ACE 2005 data converted to Concrete.ACE2005数据转换为Concrete
  • ace-05-comms-ptb-anno/: The ACE 2005 data converted to Concrete and annotated with Stanford CoreNLP.ACE2005数据转换为Concrete并使用斯坦福CoreNLP标注。
  • ace-05-comms-ptb-anno-chunks/: The ACE 2005 data converted to Concrete and annotated with Stanford CoreNLP and chunklink.pl.ACE2005数据转换为Concrete并使用斯坦福CoreNLP和chunklink.pl标注。
  • ace-05-comms-ptb-anno-chunks-json{-ng14,-pm13,-ygd15-r11,-ygd15-r32}/: The fully annotated data converted to Concatenated JSON.转换为串接JSON的带有完全标注的数据
  • ace-05-splits/: The same data as above but each subdirectory contains the data split into separate domains (i.e. Newswire (nw), Broadcast Conversation (bc), Broadcast News (bn), Telephone Speech (cts), Usenet Newsgroups (un), and Weblogs (wl)). It also includes the training set bn+nw from (Gormley, Yu, & Dredze, EMNLP 2015), as well as the dev and test splits of the bc domain: bc_dev/ and bc_test/ respectively.与上面数据相同,但是每个子目录会包含不同领域的数据分割(例如:Newswire (nw), Broadcast Conversation (bc), Broadcast News (bn), Telephone Speech (cts), Usenet Newsgroups (un), and Weblogs (wl)),以及bc_dev/和bc_test/为开发和测试分割。

We recommend all users of this pipeline use the files in ace-05-splits for replicating the settings of prior work.

我们推荐所有这个流程的用户使用文件ace-05-splits 用于复制以前工作的设置

A key difference between the Concrete and JSON formats: for each sentence, the Concrete data includes all of the relation and named entity labels. By contrast, the JSON data includes multiple copies of each sentence with one relation / named entity pair per copy. Further, the JSON data includes explict NO_RELATION labels, whereas the Concrete data only includes the positive labels. The literature includes several ways of defining the positive relation labels (e.g. with or without direction) and the negative relations (i.e. NO_RELATION for all pairs vs. only those pairs with some number of intervening entity mentions). The JSON format for the directories ending in {-ng14,-pm13,-ygd15-r11,-ygd15-r32} corresponds to several such idiosyncractic formats. See below for more details.

Concrete和JSON格式的区别:
1. 对于每个句子,Concrete数据包含所有的关系和命名实体标签。相反的,JSON数据包含每个句子的多个拷贝带有一个关系,以及每个拷贝的命名实体对。
2.除此之外,JSON数据包含额外的NO_RELATION标签,而Concrete只包含了实际的标签。
3.文字中包含了一些方法来定义真实的标签(例如:有或没有方向)以及负向关系(例如,所有的对或一些数量的实体提及之间无关系)。JSON有方向的格式以 {-ng14,-pm13,-ygd15-r11,-ygd15-r32}结尾,相应的一些特殊格式。

更多细节如下所示:

Citations

引用

The data in the directories ending with -pm13 is the same data from (Gormley, Yu, & Dredze, EMNLP 2015) that replicates the settings of (Plank & Moschitti, 2013).

以-pm13结尾的目录中的数据与(Gormley, Yu, & Dredze, EMNLP 2015)中复制(Plank & Moschitti, 2013)设置的相同数据

@inproceedings{gormley_improved_2015,
    author = {Matthew R. Gormley and Mo Yu and Mark Dredze},
    title = {Improved Relation Extraction with Feature-rich Compositional Embedding Model},
    booktitle = {Proceedings of {EMNLP}},
    year = {2015},
}

The data in the directories ending with -ng14 replicates the settings of (Nguyen & Grishman, 2014).

目录中以-ng14结尾的数据复制了(Nguyen & Grishman, 2014)的设置。

The data in the directories ending with -ygd15-r11 and -ygd-r32 is the 11 output and 32 output labeled data from (Yu, Gormley, & Dredze, NAACL 2015).

目录中以-ygd15-r11和-ygd-r32结尾的数据是来自(Yu, Gormley, & Dredze, NAACL 2015)的11个输出和32个标记输出数据。

@inproceedings{yu_combining_2015,
    author = {Yu, Mo and Gormley, Matthew R. and Dredze, Mark},
    title = {Combining Word Embeddings and Feature Embeddings for Fine-grained Relation Extraction},
    booktitle = {Proceedings of {NAACL}},
    year = {2015}
}

Requirements

要求

 

  • Java 1.8+
  • Maven 3.4+
  • Python 2.7.x
  • GNU Make

Convert ACE 2005 Data to Concrete

将ACE 2005数据转换为Concrete

First find the correct ACE 2005 data from the LDC release LDC2006T06. We use the adjudicated files located in 'adj' subdirectories.
首先从LDC发布的LDC2006T06正确的ACE2005数据。我们使用位于“adj”子目录中的裁决文件。

Convert and Annotate Full Dataset

转换和注释完整的数据集

A Makefile is included to to convert the full ACE 2005 dataset to Concrete. The same Makefile will also add Stanford CoreNLP annotations and convert the constituency trees to chunks with chunklink.pl. It will also require install the latest version of concrete-python and clone the concrete-chunklink repository.

包含一个Makefile来将完整的ACE 2005数据集转换为Concrete。同样Makefile文件也可以添加斯坦福CoreNLP注释并使用chunklink.pl转换选区树为块。它也需要安装最后版本的concrete-python并且克隆concrete-chunklink知识库。

The command below will convert the data to Concrete (with AceApf2Concrete), annotate (with Stanford and chunklink.pl), and split the data back into domains (with split_ace_dir.sh).

下面的命令将数据转换为Concrete(使用AceApf2Concrete)、annotate(使用Stanford和chunklink.pl),并将数据拆分回域(使用split_ace_dir.sh)。

make LDC_DIR= \
     OUT_DIR= \
     ace05splits

Convert a Single File to Concrete

将单个文件转换为Concrete

To convert a single ACE file to Concrete use AceApf2Concrete. Note that the apf.v5.1.1.dtd file must be in the same directory as the .apf.xml and .sgm files or the DOM reader will throw a FileNotFound exception.

使用AceApf2Concrete将单个ACE文件转换为Concrete。注意apf.v5.1.1。dtd文件必须与.apf.xml和.sgm文件位于同一个目录中,否则DOM阅读器将抛出FileNotFound异常。

source setupenv.sh
cp LDC2006T06/dtd/*.dtd ./
java -ea edu.jhu.re.AceApf2Concrete APW_ENG_20030322.0119.apf.xml APW_ENG_20030322.0119.comm

你可能感兴趣的:(自然语言处理)