10 Corpus tools

10.1 Toolset: The tools subset of Pepper, Atomic and Annis forms a complete corpus workflow toolchain in itself, which is based on Salt: 


Note that although all tools in the set can be used independently, their interoperability lets the user benefit most from corpus.tools.org when using all tools together. 

10.2 A common generic data model: SALT

Salt is a generic, graph-based meta model for linguistic data, implemented as an open source Java API for storing, manipulating and representing data. Salt is text-based. 

A syntactially annotated sentence modeled in Salt as given in Figure 2:


Nodes and Edges are placeholders, All Nodes and Edges belonging to morphological annotation, syntax annotation, information structure annotation can be bundled in seperate layers. 

10.3 Creating/migrating corpus resources for annotation: Pepper

Corpora and annotations exist in a multitude of different formats. In order to prepare them for further annotation, it is necessary to convert them into a format the annotation software cann process. This can be done via Pepper, a platformindependent, modular framwort for converting and processing linguistic data. 

Pepper supplies three types of modules: importers, manipulators and exporters, an unrestricted number of which can be combined into one single conversion workflow. Pepper has implemented multithreading in order to greatly reduce conversion times. 

In order to build multi-layer corpora we need to combine different kinds of annotation. Salt and Pepper allows for a combination of each set of importers with the merging  step. It can also extract metadata, structural and annotation-related information from existing corpora. Due to its plugin-based architecture, newly-developed modules can easily be added to Pepper at any time. 

Pepper comes in two flavours: as an interactive standalone command line tool and as an API library, which can be integrated in other software. 

10.4 Annotation: Atomic

To facilitate the creation of corpora with Atomic, for example, the software does provide some basic pre-processing tools - a tokenizer and a partitioning tool -, but more importantly also extension points for further, custom preprocessing steps. Any corpus processing step can thus be implemented as an Eslipse plugin and added to Atomic dynamically cia the respective extension point. Thus, atomic is in principle an annotation platform rather than simply an annotation tool. 


10.5 Query and analysis : ANNIS

Annis provides a browser-search and visualization architecture for complex multi-layer corpora. 

Annis also makes use of Salt as a data model. Annis provides the native query language AQL for complex search queries as well as different visualizations for corpus data, such as kwic views, dependecy trees, coreference and so on. It can be extended with new plugins:


Conclusion:

This toolset facilitates a complete workflow for multi-lay corpora, from creation and annotation to analysis and release. 


Quelle: corpus-tools.org: An Interoperable Generic Software Tool Set for Multi-layer Linguistic Corpora. 

你可能感兴趣的:(10 Corpus tools)