penn tree bank 3/n

2.3.1 Automated Stage. During the early stages of the Penn Treebank project, the
initial automatic POS assignment was provided by PARTS (Church 1988), a stochastic
algorithm developed at AT&T Bell Labs. PARTS uses a modified version of the Brown
Corpus tagset close to our own and assigns POS tags with an error rate of 3-5%. The
output of PARTS was automatically tokenized' and the tags assigned by PARTS were
automatically mapped onto the Penn Treebank tagset. This mapping introduces about
4% error, since the Penn Treebank tagset makes certain distinctions that the PARTS
tagset does not.9 A sample of the resulting tagged text, which has an error rate of
7-9%, is shown in Figure 1.
    More recently, the automatic POS assignment is provided by a cascade of stochastic
and rule-driven taggers developed on the basis of our early experience. Since these
taggers are based on the Penn Treebank tagset, the 4% error rate introduced as an
artefact of mapping from the PARTS tagset to ours is eliminated, and we obtain error
rates of 2一%.
2.3.2 Manual Correction Stage. The result of the first, automated stage of POS tagging
is given to annotators to correct. The annotators use a mouse-based package written
8 In contrast to the Brown Corpus, we do not allow compound tags of the sort illustrated above for 1'm.
Rather, contractions and the Anglo-Saxon genitive of nouns are automatically split into their
component morphemes, and each morpheme is tagged separately. Thus, children's is tagged
  "children/NNS 's/POS," and won't is tagged "wo-/MD n't/RB“
9 The two largest sources of mapping error are that the PARTS tagset distinguishes neither infinitives
from non-third person singular present tense forms of verbs, nor prepositions from particles in cases
like run up a hill and run up a bill.
317Computational Linguistics
Volume 19, Number 2
    Battle-tested/NNP industrial/JJ managers/NNS here/RB
always/RB buck/VB up/IN nervous/JJ newcomers/NNS with/IN the/DT tale/NN
WIN the/DT first/JJ of /IN their/:PP$ countrymen/NNS to/TO visit/VB
Mexico/NNP,/,a/DT boatload/NN of/IN samurai/NNS warriors/NNS
blown/VBN ashore/RB 375/CD years/NNS ago/RB./.
    “/“From/IN the/DT beginning/NN,/,it/PRP took/VBD a/DT man/NN
with/IN extraordinary/JJ qualities/NNS to/TO succeed/VB in/IN Mexico/NNP,/,
‘’/’‘says/VBZ Kimihide/NNP Takimura/NNP,/,president/NN of/IN Mitsui/NNS
group/NN 's/POS Kensetsu/NNP Engineering/NNP Inc./NNP unit/NN./.
Figure 1
Sample tagged text-before correction
    Battle-tested/NNP*/JJ industrial/JJ managers/NNS here/RB
always/RB buck/VB*/VBP up/IN*/RP nervous/JJ newcomers/NNS with/IN
the/DT tale/NN WIN the/DT first/JJ of/IN their/PP$ countrymen/NNS to/TO
visit/VB Mexico/NNP,/,a/DT boatload/NN WIN samurai/NNS*/FW
warriors/NNS blown/VBN ashore/RB 375/CD years/NNS ago/RB./.
    “/“From/IN the/DT beginning/NN,/,it/PRP took/VBD a/DT man/NN
with/IN extraordinary/JJ qualities/NNS to/TO succeed/VB WIN Mexico/NNP,/,
“/’‘says/VBZ Kimihide/NNP Takimura/NNP,/,president/NN WIN
Mitsui/NNS*/NNP group/NN 's/POS Kensetsu/NNP Engineering/NNP Inc./NNP
unit/NN./.
Figure 2
Sample tagged text-after correction
in GNU Emacs Lisp, which is embedded within the GNU Emacs editor (Lewis et al.
1990). The package allows annotators to correct POS assignment errors by positioning
the cursor on an incorrectly tagged word and then entering the desired correct tag
(or sequence of multiple tags). The annotators' input is automatically checked against
the list of legal tags in Table 2 and, if valid, appended to the original word-tag pair
separated by an asterisk. Appending the new tag rather than replacing the old tag
allows us to easily identify recurring errors at the automatic POS assignment stage.
We believe that the confusion matrices that can be extracted from this information
should also prove useful in designing better automatic taggers in the future. The result
of this second stage of POS tagging is shown in Figure 2. Finally, in the distribution
version of the tagged corpus, any incorrect tags assigned at the first, automatic stage
are removed.
    The learning curve for the POS tagging task takes under a month (at 15 hours a
week), and annotation speeds after a month exceed 3,000 words per hour.
3. Two Modes of Annotation-An Experiment
To determine how to maximize the speed, inter-annotator consistency, and accuracy of
POS tagging, we performed an experiment at the very beginning of the project to com-
pare two alternative modes of annotation. In the first annotation mode ("tagging"),
annotators tagged unannotated text entirely by hand; in the second mode ("correct-
ing"), they verified and corrected the output of PARTS, modified as described above.
318Mitchell P Marcus et al.
Building a Large Annotated Corpus of English
This experiment showed that manual tagging took about twice as long as correcting,
with about twice the inter-annotator disagreement rate and an error rate that was
about 50% higher.
    Four annotators, all with graduate training in linguistics, participated in the exper-
iment. All completed a training sequence consisting of 15 hours of correcting followed
by 6 hours of tagging. The training material was selected from a variety of nonfiction
genres in the Brown Corpus. All the annotators were familiar with GNU Emacs at the
outset of the experiment. Eight 2,000-word samples were selected from the Brown Cor-
pus, two each from four different genres (two fiction, two nonfiction), none of which
any of the annotators had encountered in training. The texts for the correction task
were automatically tagged as described in Section 2.3. Each annotator first manually
tagged four texts and then corrected four automatically tagged texts. Each annotator
completed the four genres in a different permutation.
    A repeated measures analysis of annotation speed with annotator identity, genre,
and annotation mode (tagging vs. correcting) as classification variables showed a sig-
nificant annotation mode effect (p=.05). No other effects or interactions were signif-
icant. The average speed for correcting was more than twice as fast as the average
speed for tagging: 20 minutes vs. 44 minutes per 1,000 words. (Median speeds per
1,000 words were 22 vs. 42 minutes.)
    A simple measure of tagging consistency is inter-annotator disagreement rate, the
rate at which annotators disagree with one another over the tagging of lexical tokens,
expressed as a percentage of the raw number of such disagreements over the number
of words in a given text sample. For a given text and n annotators, there are
disagreement ratios (one for each possible pair of annotators). Mean inter-annotator
disagreement was 7.2% for the tagging task and 4.1% for the correcting task (with me-
dians 7.2% and 3.6%, respectively). Upon examination, a disproportionate amount of
disagreement in the correcting case was found to be caused by one text that contained
many instances of a cover symbol for chemical and other formulas. In the absence of
an explicit guideline for tagging this case, the annotators had made different decisions
on what part of speech this cover symbol represented. When this text is excluded
from consideration, mean inter-annotator disagreement for the correcting task drops
to 3.5%, with the median unchanged at 3.6%.
    Consistency, while desirable, tells us nothing about the validity of the annotators'
corrections. We therefore compared each annotator's output not only with the output
of each of the others, but also with a benchmark version of the eight texts. This
benchmark version was derived from the tagged Brown Corpus by (1) mapping the
original Brown Corpus tags onto the Penn Treebank tagset and (2) carefully hand-
correcting the revised version in accordance with the tagging conventions in force at
the time of the experiment. Accuracy was then computed as the rate of disagreement
between each annotator's results and the benchmark version. The mean accuracy was
5.4% for the tagging task (median 5.7%) and 4.0% for the correcting task (median 3.4%).
Excluding the same text as above gives a revised mean accuracy for the correcting task
of 3.4%, with the median unchanged.
    We obtained a further measure of the annotators' accuracy by comparing their
error rates to the rates at which the raw output of Church's PARTS program-appropri-
ately modified to conform to the Penn Treebank tagset-disagreed with the benchmark
version. The mean disagreement rate between PARTS and the benchmark version was
319Computational Linguistics
Volume 19, Number 2
9.6%, while the corrected version had a mean disagreement rate of 5.4%, as noted
above." The annotators were thus reducing the error rate by about 4.2%.

你可能感兴趣的:(UP,vb,emacs,lisp)