原文摘自:Learning Subjective Nouns using Extraction Pattern Bootstrapping
3.1 Meta-Bootstrapping
The Meta-Bootstrapping (“MetaBoot”) process (Riloff and Jones, 1999) begins with a small set of seed words that represent a targeted semantic category (e.g., 10 words that represent LOCATIONS) and an unannotated corpus. First, MetaBoot automatically creates a set of extraction patterns for the corpus by applying and instantiating syntactic templates. This process literally produces thousands of extraction patterns that, collectively, will extract every noun phrase in the corpus. Next, MetaBoot computes a score for each pattern based upon the number of seed words among its extractions. The best pattern is saved and all of its extracted noun phrases are automatically labeled as the targeted semantic category. MetaBoot then re-scores the extraction patterns, using the original seed words as well as the newly labeled words, and the process repeats. This procedure is called mutual bootstrapping.
A second level of bootstrapping (the “meta-” bootstrapping part) makes the algorithm more robust. When the mutual bootstrapping process is finished, all nouns that were put into the semantic dictionary are reevaluated. Each noun is assigned a score based on how many different patterns extracted it. Only the five best nouns are allowed to remain in the dictionary. The other entries are discarded, and the mutual bootstrapping process starts over again using the revised semantic dictionary.
3.2 Basilisk
Basilisk (Thelen and Riloff, 2002) is a more recent bootstrapping algorithm that also utilizes extraction patterns
to create a semantic dictionary. Similarly, Basilisk begins with an unannotated text corpus and a small set of seed words for a semantic category. The bootstrapping process involves three steps. (1) Basilisk automatically generates a set of extraction patterns for the corpus and scores each pattern based upon the number of seed words among its extractions. This step is identical to the first step of Meta-Bootstrapping. Basilisk then puts the best patterns into a Pattern Pool. (2) All nouns extracted by a pattern in the Pattern Pool are put into a Candidate Word Pool. Basilisk scores each noun based upon the set of patterns that extracted it and their collective association with the seed words. (3) The top 10 nouns are labeled as the targeted semantic class and are added to the dictionary. The bootstrapping process then repeats, using the original seeds and the newly labeled words.
The main difference between Basilisk and MetaBootstrapping is that Basilisk scores each noun based on collective information gathered from all patterns that extracted it. In contrast, Meta-Bootstrapping identifies a single best pattern and assumes that everything it extracted belongs to the same semantic class. The second level of bootstrapping smoothes over some of the problems caused by this assumption. In comparative experiments (Thelen and Riloff, 2002), Basilisk outperformed Meta-Bootstrapping. But since our goal of learning subjective nouns is different from the original intent of the algorithms, we tried them both. We also suspected they might learn different words, in which case using both algorithms could be worthwhile.