3.6 Normalizing Text 规格化文本
In earlier program examples we have often converted text to lowercase before doing anything with its words, e.g., set(w.lower() for w in text). By using lower(), we have normalized the text to lowercase so that the distinction between The and the is ignored. Often we want to go further than this and strip off any affixes(词缀), a task known as stemming(提取词干). A further step is to make sure that the resulting form is a known word in a dictionary, a task known as lemmatization(词元化). We discuss each of these in turn. First, we need to define the data we will use in this section:
>>>
raw
=
"""
DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government. Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony.
"""
>>>
tokens
=
nltk.word_tokenize(raw)
Stemmers 词干器
NLTK includes several off-the-shelf stemmers, and if you ever need a stemmer, you should use one of these in preference to crafting(制作) your own using regular expressions, since NLTK’s stemmers handle a wide range of irregular cases. The Porter and Lancaster stemmers follow their own rules for stripping affixes. Observe that the Porter stemmer correctly handles the word lying (mapping it to lie), whereas the Lancaster stemmer does not.(我觉得basis两个分得都不好)
>>>
porter
=
nltk.PorterStemmer()
>>>
lancaster
=
nltk.LancasterStemmer()
>>>
[porter.stem(t)
for
t
in
tokens]
[
'
DENNI
'
,
'
:
'
,
'
Listen
'
,
'
,
'
,
'
strang
'
,
'
women
'
,
'
lie
'
,
'
in
'
,
'
pond
'
,
'
distribut
'
,
'
sword
'
,
'
is
'
,
'
no
'
,
'
basi
'
,
'
for
'
,
'
a
'
,
'
system
'
,
'
of
'
,
'
govern
'
,
'
.
'
,
'
Suprem
'
,
'
execut
'
,
'
power
'
,
'
deriv
'
,
'
from
'
,
'
a
'
,
'
mandat
'
,
'
from
'
,
'
the
'
,
'
mass
'
,
'
,
'
,
'
not
'
,
'
from
'
,
'
some
'
,
'
farcic
'
,
'
aquat
'
,
'
ceremoni
'
,
'
.
'
]
>>>
[lancaster.stem(t)
for
t
in
tokens]
[
'
den
'
,
'
:
'
,
'
list
'
,
'
,
'
,
'
strange
'
,
'
wom
'
,
'
lying
'
,
'
in
'
,
'
pond
'
,
'
distribut
'
,
'
sword
'
,
'
is
'
,
'
no
'
,
'
bas
'
,
'
for
'
,
'
a
'
,
'
system
'
,
'
of
'
,
'
govern
'
,
'
.
'
,
'
suprem
'
,
'
execut
'
,
'
pow
'
,
'
der
'
,
'
from
'
,
'
a
'
,
'
mand
'
,
'
from
'
,
'
the
'
,
'
mass
'
,
'
,
'
,
'
not
'
,
'
from
'
,
'
som
'
,
'
farc
'
,
'
aqu
'
,
'
ceremony
'
,
'
.
'
]
Stemming is not a well-defined process, and we typically pick the stemmer that best suits the application we have in mind. The Porter Stemmer is a good choice if you are indexing some texts and want to support search using alternative forms of words (illustrated in Example 3-1, which uses object-oriented programming techniques that are outside the scope of this book, string formatting techniques to be covered in Section 3.9, and the enumerate() function to be explained in Section 4.2).
Example 3-1. Indexing a text using a stemmer.
class
IndexedText(object):
def
__init__
(self, stemmer, text):
self._text
=
text
self._stemmer
=
stemmer
self._index
=
nltk.Index((self._stem(word), i)
for
(i, word)
in
enumerate(text))
def
concordance(self, word, width
=
40
):
key
=
self._stem(word)
wc
=
width
/
4
#
words of context
for
i
in
self._index[key]:
lcontext
=
'
'
.join(self._text[i
-
wc:i])
rcontext
=
'
'
.join(self._text[i:i
+
wc])
ldisplay
=
'
%*s
'
%
(width, lcontext[
-
width:])
rdisplay
=
'
%-*s
'
%
(width, rcontext[:width])
print
ldisplay, rdisplay
def
_stem(self, word):
return
self._stemmer.stem(word).lower()
>>>
porter
=
nltk.PorterStemmer()
>>>
grail
=
nltk.corpus.webtext.words(
'
grail.txt
'
)
>>>
text
=
IndexedText(porter, grail)
>>>
text.concordance(
'
lie
'
)
r king ! DENNIS : Listen , stran
ge women lying
in
ponds distributing swords
is
no
beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well
ere
is
much danger ,
for
beyond the cave lies the Gorge of Eternal Peril , which
you . Oh ... TIM : To the north there lies a cave
--
the cave of Caerbannog
--
h it
and
lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not
stop our fight
'
til each one of you lies dead , and the Holy Grail returns t
Lemmatization 词元化
The WordNet lemmatizer removes affixes only if the resulting word is in its dictionary. This additional checking process makes the lemmatizer slower than the stemmers just mentioned. Notice that it doesn’t handle lying, but it converts women to woman.
>>>
wnl
=
nltk.WordNetLemmatizer()
>>>
[wnl.lemmatize(t)
for
t
in
tokens]
[
'
DENNIS
'
,
'
:
'
,
'
Listen
'
,
'
,
'
,
'
strange
'
,
'
woman
'
,
'
lying
'
,
'
in
'
,
'
pond
'
,
'
distributing
'
,
'
sword
'
,
'
is
'
,
'
no
'
,
'
basis
'
,
'
for
'
,
'
a
'
,
'
system
'
,
'
of
'
,
'
government
'
,
'
.
'
,
'
Supreme
'
,
'
executive
'
,
'
power
'
,
'
derives
'
,
'
from
'
,
'
a
'
,
'
mandate
'
,
'
from
'
,
'
the
'
,
'
mass
'
,
'
,
'
,
'
not
'
,
'
from
'
,
'
some
'
,
'
farcical
'
,
'
aquatic
'
,
'
ceremony
'
,
'
.
'
]
The WordNet lemmatizer is a good choice if you want to compile the vocabulary of some texts and want a list of valid lemmas (or lexicon headwords(中心词)).
Another normalization task involves identifying non-standard words, including numbers, abbreviations, and dates, and mapping any such tokens to a special vocabulary. For example, every decimal number could be mapped to a single token 0.0, and every acronym(首字母缩写词) could be mapped to AAA. This keeps the vocabulary small and improves the accuracy of many language modeling tasks.