penn tree bank 4/n

4. Bracketing
4.1 Basic Methodology
The methodology for bracketing the corpus is completely parallel to that for tagging-
hand correction of the output of an errorful automatic process. Fidditch, a deterministic
parser developed by Donald Hindle first at the University of Pennsylvania and sub-
sequently at AT&T Bell Labs (Hindle 1983, 1989), is used to provide an initial parse of
the material. Annotators then hand correct the parser's output using a mouse-based
interface implemented in GNU Emacs Lisp. Fidditch has three properties that make it
ideally suited to serve as a preprocessor to hand correction:
    Fidditch always provides exactly one analysis for any given sentence, so
    that annotators need not search through multiple analyses.
    Fidditch never attaches any constituent whose role in the larger structure
    it cannot determine with certainty. In cases of uncertainty, Fidditch
    chunks the input into a string of trees, providing only a partial structure
    for each sentence.
    Fidditch has rather good grammatical coverage, so that the grammatical
    chunks that it does build are usually quite accurate.
    Because of these properties, annotators do not need to rebracket much of the
parser's output-a relatively time-consuming task. Rather, the annotators' main task
is to "glue" together the syntactic chunks produced by the parser. Using a mouse-based
interface, annotators move each unattached chunk of structure under the node to which
it should be attached. Notational devices allow annotators to indicate uncertainty
concerning constituent labels, and to indicate multiple attachment sites for ambiguous
modifiers. The bracketing process is described in more detail in Section 4.3.
4.2 The Syntactic Tagset
Table 3 shows the set of syntactic tags and null elements that we use in our skeletal
bracketing. More detailed information on the syntactic tagset and guidelines concern-
ing its use are to be found in Santorini and Marcinkiewicz (1991).
    Although different in detail, our tagset is similar in delicacy to that used by the
Lancaster Treebank Project, except that we allow null elements in the syntactic anno-
tation. Because of the need to achieve a fairly high output per hour, it was decided
not to require annotators to create distinctions beyond those provided by the parser.
Our approach to developing the syntactic tagset was highly pragmatic and strongly
influenced by the need to create a large body of annotated material given limited hu-
man resources. Despite the skeletal nature of the bracketing, however, it is possible to
make quite delicate distinctions when using the corpus by searching for combinations
of structures. For example, an SBAR containing the word to immediately before the
VP will necessarily be infinitival, while an SBAR containing a verb or auxiliary with a
320Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
Table 3
The Penn Treebank syntactic tagset.
ADJP
ADVP
NP
PP
S
SBAR
SBARQ
SINV
SQ
VP
WHADVP
WHNP
WHPP
X
Adjective phrase
Adverb phrase
Noun phrase
Prepositional phrase
Simple declarative clause
Clause introduced by subordinating conjunction or 0 (see below)
Direct question introduced by wh-word or wh-phrase
Declarative sentence with subject-aux inversion
Subconstituent of SBARQ excluding wh-word or wh-phrase
Verb phrase
wh-adverb phrase
wh-noun phrase
wh-prepositional phrase
Constituent of unknown or uncertain category
Null elements
"Understood" subject of infinitive or imperative
Zero variant of that in subordinate clauses
Trace-marks position where moved wh-constituent is interpreted
Marks position where preposition is interpreted in pied-piping contexts
tense feature will necessarily be tensed. To take another example, so-called that-clauses
can be identified easily by searching for SBARs containing the word that or the null
element 0 in initial position.
    As can be seen from Table 3, the syntactic tagset used by the Penn Treebank in-
cludes a variety of null elements, a subset of the null elements introduced by Fidditch.
While it would be expensive to insert null elements entirely by hand, it has not proved
overly onerous to maintain and correct those that are automatically provided. We have
chosen to retain these null elements because we believe that they can be exploited in
many cases to establish a sentence's predicate-argument structure; at least one recipient
of the parsed corpus has used it to bootstrap the development of lexicons for partic-
ular NLP projects and has found the presence of null elements to be a considerable
aid in determining verb transitivity (Robert Ingria, personal communication). While
these null elements correspond more directly to entities in some grammatical theories
than in others, it is not our intention to lean toward one or another theoretical view in
producing our corpus. Rather, since the representational framework for grammatical
structure in the Treebank is a relatively impoverished flat context-free notation, the eas-
iest mechanism to include information about predicate-argument structure, although
indirectly, is by allowing the parse tree to contain explicit null items.
4.3 Sample Bracketing Output
Below, we illustrate the bracketing process for the first sentence of our sample text.
Figure 3 shows the output of Fidditch (modified slightly to include our POS tags).
    As Figure 3 shows, Fidditch leaves very many constituents unattached, labeling
them as "?", and its output is perhaps better thought of as a string of tree fragments
than as a single tree structure. Fidditch only builds structure when this is possible for
a purely syntactic parser without access to semantic or pragmatic information, and it
321Computational Linguistics
Volume 19, Number 2
((S
      (NP (NBAR (ADJP (ADJ "Battle-tested/JJ")
                        (ADJ "industrial/JJ"))
              (NPL "managers/NNS")))
      (?(ADV "here/RB"))
    (?(ADV "always/RB"))
    (AUX (TNS*))
    (VP (VPRES "buck/VBP")))
    (?(PP (PREP "up/RP")
              (NP (NEAR (ADJ "nervous/JJ")
                        (NPL "newcomers/NNS")))))
      (?(PP (PREP "with/IN")
              (NP (DART "the/DT")
                    (NEAR (N "tale/NN"))
                            (PP of/PREP
                                (NP (DART "the/DT")
                                      (NEAR (ADJP
                                        (ADJ "first/JJ"))))))))
      (?(PP of/PREP
              (NP (PROS "their/PP$")
                (NEAR (NPL "countrymen/NNS"))))
      (?(S (NP (PRO*))
                  (AUX to/TNS)
                  (VP (V "visit/VB")
                      (NP (PNP "Mexico/NNP")))))
    (?(MID”,/,”))
      (?(NP (IART "a/DT")
              (NEAR (N "boatload/NN"))
                      (PP of/PREP
                            (NP (NBAR
                              (NPL "warriors/NNS"))))
                      (VP (VPPRT "blown/VBN")
                          (?(ADV "ashore/RB"))
                            (NP (NEAR (CARD "375/CD")
                                  (NPL "years/NNS"))))))
    (?(ADV "ago/RB"))
      (?(FIN”./.”)))
Figure 3
Sample bracketed text-full structure provided by Fidditch.

你可能感兴趣的:(Access,UP,vb,emacs,lisp)