always errs on the side of caution. Since determining the correct attachment point of
prepositional phrases, relative clauses, and adverbial modifiers almost always requires
extrasyntactic information, Fidditch pursues the very conservative strategy of always
leaving such constituents unattached, even if only one attachment point is syntacti-
cally possible. However, Fidditch does indicate its best guess concerning a fragment's
attachment site by the fragment's depth of embedding. Moreover, it attaches preposi-
tional phrases beginning with of if the preposition immediately follows a noun; thus,
tale of…and boatload of…are parsed as single constituents, while first of…is not.
Since Fidditch lacks a large verb lexicon, it cannot decide whether some constituents
serve as adjuncts or arguments and hence leaves subordinate clauses such as infini-
322Mitchell P. Marcus et al.
Building a Large Annotated Corpus of English
tives as separate fragments. Note further that Fidditch creates adjective phrases only
when it determines that more than one lexical item belongs in the ADJP Finally, as
is well known, the scope of conjunctions and other coordinate structures can only
be determined given the richest forms of contextual information; here again, Fidditch
simply turns out a string of tree fragments around any conjunction. Because all de-
cisions within Fidditch are made locally, all commas (which often signal conjunction)
must disrupt the input into separate chunks
The original design of the Treebank called for a level of syntactic analysis compa-
rable to the skeletal analysis used by the Lancaster Treebank, but a limited experiment
was performed early in the project to investigate the feasibility of providing greater
levels of structural detail. While the results were somewhat unclear, there was ev-
idence that annotators could maintain a much faster rate of hand correction if the
parser output was simplified in various ways, reducing the visual complexity of the
tree representations and eliminating a range of minor decisions. The key results of this
experiment were as follows:
Annotators take substantially longer to learn the bracketing task than the
POS tagging task, with substantial increases in speed occurring even
after two months of training.
Annotators can correct the full structure provided by Fidditch at an
average speed of approximately 375 words per hour after three weeks
and 475 words per hour after six weeks.
Reducing the output from the full structure shown in Figure 3 to a more
skeletal representation similar to that used by the Lancaster UCREL
Treebank Project increases annotator productivity by approximately
100-200 words per hour.
It proved to be very difficult for annotators to distinguish between a
verb's arguments and adjuncts in all cases. Allowing annotators to
ignore this distinction when it is unclear (attaching constituents high)
increases productivity by approximately 150-200 words per hour.
Informal examination of later annotation showed that forced distinctions
cannot be made consistently.
As a result of this experiment, the originally proposed skeletal representation was
adopted, without a forced distinction between arguments and adjuncts. Even after
extended training, performance varies markedly by annotator, with speeds on the task
of correcting skeletal structure without requiring a distinction between arguments and
adjuncts ranging from approximately 750 words per hour to well over 1,000 words
per hour after three or four months' experience. The fastest annotators work in bursts
of well over 1,500 words per hour alternating with brief rests. At an average rate
of 750 words per hour, a team of five part-time annotators annotating three hours a
day should maintain an output of about 2.5 million words a year of "treebanked"
sentences, with each sentence corrected once.
It is worth noting that experienced annotators can proofread previously corrected
material at very high speeds. A parsed subcorpus of over 1 million words was recently
proofread at an average speed of approximately 4,000 words per annotator per hour.
At this rate of productivity, annotators are able to find and correct gross errors in
parsing, but do not have time to check, for example, whether they agree with all
prepositional phrase attachments.
323Computational Linguistics
Volume 19, Number 2
((S
(NP (ADJP Battle-tested industrial)
managers)
(?here)
(?always)
(VP buck))
(?(PP up
(NP nervous newcomers)))
(?(PP with
(NP the tale
(PP of
(NP the
(ADJP first))))))
(?(PP of
(NP their countrymen)))
(?(S (NP*)
to
(UP visit
(NP Mexico))))
(?,)
(?(NP a boatload
(PP of
(NP warriors))
(VP blown
(?ashore)
(NP 375 years))))
(?ago)
(?.))
Figure 4
Sample bracketed text-after simplification, before correction.
The process that creates the skeletal representations to be corrected by the anno-
tators simplifies and flattens the structures shown in Figure 3 by removing POS tags,
nonbranching lexical nodes, and certain phrasal nodes, notably NBAR. The output of
the first automated stage of the bracketing task is shown in Figure 4.
Annotators correct this simplified structure using a mouse-based interface. Their
primary job is to "glue" fragments together, but they must also correct incorrect parses
and delete some structure. Single mouse clicks perform the following tasks, among
others. The interface correctly reindents the structure whenever necessary.
Attach constituents labeled ?. This is done by pressing down the
appropriate mouse button on or immediately after the ?, moving the
mouse onto or immediately after the label of the intended parent and
releasing the mouse. Attaching constituents automatically deletes their?
label.
Promote a constituent up one level of structure, making it a sibling of its
current parent.
Delete a pair of constituent brackets.
324Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
((S
(NP Battle-tested industrial managers
here)
always
(VP buck
up
(NP nervous newcomers)
(PP with
(NP the tale
(PP of
(NP (NP the
(ADJP first
(PP of
(NP their countrymen)))
(S (NP*)
to
(VP visit
(NP Mexico))))
(NP (NP a boatload
(PP of
(NP (NP warriors)
(VP-1 blown
ashore
(ADVP (NP 375 years)
ago)))))
(VP-1 *pseudo-attach*))))))))
.)
Figure 5
Sample bracketed text-after correction.
Create a pair of brackets around a constituent. This is done by typing a
constituent tag and then sweeping out the intended constituent with the
mouse. The tag is checked to assure that it is a legal label.
Change the label of a constituent. The new tag is checked to assure that
it is legal.
The bracketed text after correction is shown in Figure 5. The fragments are now
connected together into one rooted tree structure. The result is a skeletal analysis in
that much syntactic detail is left unannotated. Most prominently, all internal structure
of the NP up through the head and including any single-word post-head modifiers is
left unannotated.
As noted above in connection with POS tagging, a major goal of the Treebank
project is to allow annotators only to indicate structure of which they were certain. The
Treebank provides two notational devices to ensure this goal: the X constituent label
and so-called "pseudo-attachment." The X constituent label is used if an annotator
is sure that a sequence of words is a major constituent but is unsure of its syntactic
category; in such cases, the annotator simply brackets the sequence and labels it X. The
second notational device, pseudo-attachment, has two primary uses. On the one hand,
it is used to annotate what Kay has called permanent predictable ambiguities, allowing an
annotator to indicate that a structure is globally ambiguous even given the surrounding
context (annotators always assign structure to a sentence on the basis of its context). An
example of this use of pseudo-attachment is shown in Figure 5, where the participial
phrase blown ashore 375 years ago modifies either warriors or boatload, but there is no way
of settling the question-both attachments mean exactly the same thing. In the case
at hand, the pseudo-attachment notation indicates that the annotator of the sentence
thought that VP-1 is most likely a modifier of warriors, but that it is also possible that
it is a modifier of boatload." A second use of pseudo-attachment is to allow annotators
to represent the "underlying" position of extraposed elements; in addition to being
attached in its superficial position in the tree, the extraposed constituent is pseudo-
attached within the constituent to which it is semantically related. Note that except
for the device of pseudo-attachment, the skeletal analysis of the Treebank is entirely
restricted to simple context-free trees.
The reader may have noticed that the ADJP brackets in Figure 4 have vanished in
Figure 5. For the sake of the overall efficiency of the annotation task, we leave all ADJP
brackets in the simplified structure, with the annotators expected to remove many
of them during annotation. The reason for this is somewhat complex, but provides
a good example of the considerations that come into play in designing the details
of annotation methods. The first relevant fact is that Fidditch only outputs ADJP
brackets within NPs for adjective phrases containing more than one lexical item. To
be consistent, the final structure must contain ADJP nodes for all adjective phrases
within NPs or for none; we have chosen to delete all such nodes within NPs under
normal circumstances. (This does not affect the use of the ADJP tag for predicative
adjective phrases outside of NPs.) In a seemingly unrelated guideline, all coordinate
structures are annotated in the Treebank; such coordinate structures are represented
by Chomsky-adjunction when the two conjoined constituents bear the same label.
This means that if an NP contains coordinated adjective phrases, then an ADJP tag
will be used to tag that coordination, even though simple ADJPs within NPs will not
bear an APJP tag. Experience has shown that annotators can delete pairs of brackets
extremely quickly using the mouse-based tools, whereas creating brackets is a much
slower operation. Because the coordination of adjectives is quite common, it is more
efficient to leave in ADJP labels, and delete them if they are not part of a coordinate
structure, than to reintroduce them if necessary.
5. Progress to Date
5.1 Composition and Size of Corpus
Table 4 shows the output of the Penn Treebank project at the end of its first phase. All
the materials listed in Table 4 are available on CD-ROM to members of the Linguistic
Data Consortium." About 3 million words of POS-tagged material and a small sam-
pling of skeletally parsed text are available as part of the first Association for Com-
putational Linguistics/ Data Collection Initiative CD-ROM, and a somewhat larger
subset of materials is available on cartridge tape directly from the Penn Treebank
Project. For information, contact the first author of this paper or send e-mail to tree-
[email protected].