1 基础对象与方法
1.1 nltk.text.Text
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text1
>>> type(text1)
>>> dir(text1)
['_CONTEXT_RE', '_COPY_TOKENS', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__len__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_context', 'collocations', 'common_contexts', 'concordance', 'count', 'dispersion_plot', 'findall', 'generate', 'index', 'name', 'plot', 'readability', 'similar', 'tokens', 'unicode_repr', 'vocab']
Text.concordance(word)
查找某个词并显示一些上下文
>>> text1.concordance("affection")
Displaying 3 of 3 matches:
oyously assented ; for besides the affection I now felt for Queequeg , he was a
e enough , yet he had a particular affection for his own harpoon , because it w
ing cobbling jobs . Lord ! what an affection all old women have for tinkers . I
Text.similar(word)
搜索与某个词具有相似上下文的单词
>>> text1.similar("monstrous")
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate
Text.common_contexts([word1, word2, ...])
搜索参数中所有word相同的上下文,即word1、word2 ...相同的上下文
>>> text2.common_contexts(["monstrous", "very"])
a_pretty is_pretty a_lucky am_glad be_glad
Text.dispersion_plot([word1, word2, ...])
用离散图表示预料中各个word出现的位置序列表示
>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
1.2 nltk.probability.FreqDist
自动识别语料库中词汇的频率分布
方法 | 描述 |
---|---|
fdist=FreqDist(samples) | 创建包含给定样本的频率分布(samples可以是nltk.text.Text、空格分割的字符串、列表或者其他) |
fdist.inc(sample) | 增加样本 |
fdist[word] | word在样本中出现的次数 |
fdist.freq(word) | word在样本中出现的频率 |
fdist.N() | 样本总数 |
fdist.keys() | 样本list |
for sample in fdist: | 以频率递减顺序遍历样本 |
fdist.max() | 数值最大样本 |
fdist.plot() | 绘制频率分布图 |
fdist.plot(cumulative=True) | 绘制累积频率分布图 |
>>> fdist = FreqDist(text1)
>>> fdist.plot(50, cumulative=True)
1.3 nltk.util.bigrams
词语搭配是指经常一起出现的词序列,为了获取搭配首先需要从文本中提取双连词,使用bigrams就可以实现这个功能
>>> list(bigrams(["more", "is", "said", "than", "done"]))
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
除了一些含生僻词的情况,英语文本中的词语搭配基本上是频繁出现的双连词。nltk.text.Text中提供了collocations(self, num=20, window_size=2)方法可以直接从Text文本中提取常出现的词语搭配,如下
>>> text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
2 语料库及其使用
这里主要使用的是nltk.corpus中的语料
2.1 nltk.corpus.reader
corpus语料库阅读器可以高效地访问大量语料库,提供的主要方法如下:
方法 | 描述 |
---|---|
fileids() | 语料库中的文件 |
fileids([categories]) | 分类对应的语料库中的文件(可能没有) |
categories() | 语料库中的分类(可能没有) |
categories([filedids]) | 文件对应的语料库中的分类 |
raw() | 语料库的原始内容 |
raw(fileids=[f1, f2, f3]) | 指定文件的原始内容 |
raw(categories=[c1, c2]) | 指定分类的原始内容 |
words() | 整个语料库中的词汇 |
words(fileids=[f1, f2, f3]) | 指定文件中的词汇 |
words(categories=[c1, c2]) | 指定分类中的词汇 |
sents() | 整个语料库中的句子 |
sents(fileids=[f1, f2, f3]) | 指定文件中的句子 |
sents(categories=[c1, c2]) | 指定分类中的句子 |
abspath(fileid) | 指定文件在磁盘上的位置 |
encoding(fileid) | 文件的编码格式 |
open(fileid) | 打开指定语料库文件的文件流 |
root() | 到本地安装的语料库跟目录的路径 |
readme() | 语料库的README文件的内容 |
2.2 几个语料库简介
>>> from nltk.corpus import gutenberg #古腾堡语料库
>>> from nltk.corpus import webtext #网络语料库
>>> from nltk.corpus import nps_chat #聊天文本
>>> from nltk.corpus import brown #布朗语料库
>>> from nltk.corpus import reuters #路透社语料库
>>> from nltk.corpus import inaugural #就职演说语料库
fileids()
>>> gutenberg.fileids() #获取语料库中的文件
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']
categories
>>> brown.categories()
[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']
words()
>>> nps_chat.words() #获取语料库中的所有词汇列表
[u'now', u'im', u'left', u'with', u'this', u'gay', ...]
>>> gutenberg.words(["austen-emma.txt"]) #获取文件中的词汇列表
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]
>>> brown.words(categories="news") #获取news分类中的词汇列表
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
sents()
>>> inaugural.sents() #获取语料库中的所有句子列表
[[u'Fellow', u'-', u'Citizens', u'of', u'the', u'Senate', u'and', u'of', u'the', u'House', u'of', u'Representatives', u':'], [u'Among', u'the', u'vicissitudes', u'incident', u'to', u'life', u'no', u'event', u'could', u'have', u'filled', u'me', u'with', u'greater', u'anxieties', u'than', u'that', u'of', u'which', u'the', u'notification', u'was', u'transmitted', u'by', u'your', u'order', u',', u'and', u'received', u'on', u'the', u'14th', u'day', u'of', u'the', u'present', u'month', u'.'], ...]
>>> gutenberg.sents(["shakespeare-macbeth.txt"])
[[u'[', u'The', u'Tragedie', u'of', u'Macbeth', u'by', u'William', u'Shakespeare', u'1603', u']'], [u'Actus', u'Primus', u'.'], ...]
>>> brown.sents(categories=["mystery"])
[[u'There', u'were', u'thirty-eight', u'patients', u'on', u'the', u'bus', u'the', u'morning', u'I', u'left', u'for', u'Hanover', u',', u'most', u'of', u'them', u'disturbed', u'and', u'hallucinating', u'.'], [u'An', u'interne', u',', u'a', u'nurse', u'and', u'two', u'attendants', u'were', u'in', u'charge', u'of', u'us', u'.'], ...]
raw()
>>> for fileid in webtext.fileids():
... print fileid, webtext.raw(fileid)[:65], "..." #获取文件的原始内容
...
firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop]
KING ARTHUR: Whoa there! [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...
3 条件频率分布
nltk.probability.ConditionalFreqDist是用来研究不同条件下的频率分布的条件频率分布。提供的主要方法如下:
方法 | 描述 |
---|---|
cfd=ConditionalFreqDist(pairs) | 创建条件频率分布 |
cfd.conditions() | 按字母排序的分类 |
cfd[condition] | 指定条件下的频次分布(是一个FreqDist) |
cfd[codition][sample] | 指定条件以及样本的频次 |
cfd.tabulate() | 为条件频率分布制表 |
cfd.tabulate(samples, conditions) | 指定条件和样本下制表 |
cfd.plot() | 绘制条件频率分布图 |
cfd.plot(samples, conditions) | 指定条件以及样本下绘图 |
与FreqDist不同的是ConditionalFreqDist以一个配对的列表作为输入,如:
>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))
我们单独只看brown中两个类别的语料news和romance,将其组织成ConditionalFreqDist的参数(配对的列表),代码如下
>>> genre_word = [(genre, word) for genre in ["news", "romance"] for word in brown.words(categories=genre)]
>>> len(genre_word)
170576
>>> genre_word[:5]
[('news', u'The'), ('news', u'Fulton'), ('news', u'County'), ('news', u'Grand'), ('news', u'Jury')]
>>> genre_word[-5:]
[('romance', u"I'm"), ('romance', u'afraid'), ('romance', u'not'), ('romance', u"''"), ('romance', u'.')]
conditions
>>> cfd = nltk.ConditionalFreqDist(genre_word)
>>> cfd
>>> cfd.conditions()
['romance', 'news']
cfd[condition] & cfd[condition][sample]
>>> cfd["news"]
FreqDist({u'the': 5580, u',': 5188, u'.': 4030, u'of': 2849, u'and': 2146, u'to': 2116, u'a': 1993, u'in': 1893, u'for': 943, u'The': 806, ...})
>>> cfd["romance"]
FreqDist({u',': 3899, u'.': 3736, u'the': 2758, u'and': 1776, u'to': 1502, u'a': 1335, u'of': 1186, u'``': 1045, u"''": 1044, u'was': 993, ...})
>>> cfd["romance"]["love"]
32
tabulate
>>> from nltk.corpus import udhr
>>> languages = ["Chickasaw", "English", "German_Deutsch"]
>>> cfd = nltk.ConditionalFreqDist(
... (lang, len(word))
... for lang in languages
... for word in udhr.words(lang+"-Latin1"))
>>> cfd.tabulate()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23
Chickasaw 411 99 41 68 91 89 77 70 49 33 16 28 45 10 6 4 5 3 2 1 1 1
English 185 340 358 114 169 117 157 118 80 63 50 12 11 6 1 0 0 0 0 0 0 0
German_Deutsch 171 92 351 103 177 119 97 103 62 58 53 32 27 29 15 14 3 7 5 2 1 0
>>> cfd.tabulate(conditions=["English", "German_Deutsch"], samples=range(10), cumulative=True)
0 1 2 3 4 5 6 7 8 9
English 0 185 525 883 997 1166 1283 1440 1558 1638
German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275
4 词汇列表语料库
>>> from nltk.corpus import words
>>> from nltk.corpus import stopwords
>>> from nltk.corpus import names
>>> from nltk.corpus import swadesh
words
英语词汇列表
>>> words = words.words()
>>> len(words)
236736
>>> words[:20]
[u'A', u'a', u'aa', u'aal', u'aalii', u'aam', u'Aani', u'aardvark', u'aardwolf', u'Aaron', u'Aaronic', u'Aaronical', u'Aaronite', u'Aaronitic', u'Aaru', u'Ab', u'aba', u'Ababdeh', u'Ababua', u'abac']
stopwords
停用词语语料库
>>> stopwords.words("english")
[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']
names
英语名字语料库
>>> names.words("female.txt")[:10]
[u'Abagael', u'Abagail', u'Abbe', u'Abbey', u'Abbi', u'Abbie', u'Abby', u'Abigael', u'Abigail', u'Abigale']
>>> names.words("male.txt")[:10]
[u'Aamir', u'Aaron', u'Abbey', u'Abbie', u'Abbot', u'Abbott', u'Abby', u'Abdel', u'Abdul', u'Abdulkarim']
swadesh
比较词表,其中包含了几种语言的约200个常用词的列表
>>> swadesh.fileids()
[u'be', u'bg', u'bs', u'ca', u'cs', u'cu', u'de', u'en', u'es', u'fr', u'hr', u'it', u'la', u'mk', u'nl', u'pl', u'pt', u'ro', u'ru', u'sk', u'sl', u'sr', u'sw', u'uk']
>>> swadesh.words("en")[:20]
[u'I', u'you (singular), thou', u'he', u'we', u'you (plural)', u'they', u'this', u'that', u'here', u'there', u'who', u'what', u'where', u'when', u'how', u'not', u'all', u'many', u'some', u'few']
使用entries()方法可以指定多个语言中的同源词,如下所示:
[(u'je', u'I'), (u'tu, vous', u'you (singular), thou'), (u'il', u'he'), (u'nous', u'we'), (u'vous', u'you (plural)'), (u'ils, elles', u'they'), (u'ceci', u'this'), (u'cela', u'that'), (u'ici', u'here'), (u'l\xe0', u'there')]
5 WordNet
WordNet是面向语义的英语词典(主要是英语中的同义词),暂时不做介绍。
6 处理原始文本
6.1 分词并转化为Text对象
这里的分词指的是将原始文本字符串分割成词汇与标点符号组成的列表,直接使用nltk.word_tokenize()
,而后调用nltk.Text()
就可以将原始文本转化为nltk文本(Text对象):
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554-0.txt"
>>> raw = urlopen(url).read().decode("utf-8")
>>> raw[:100]
u'\ufeffThe Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nThis eBook is for the '
>>> tokens = word_tokenize(raw)
[u'\ufeffThe', u'Project', u'Gutenberg', u'EBook', u'of', u'Crime', u'and', u'Punishment', u',', u'by', u'Fyodor', u'Dostoevsky', u'This', u'eBook', u'is', u'for', u'the', u'use', u'of', u'anyone']
>>> text = nltk.Text(tokens) #转化为nltk.Text对象,而后就可以使用Text中的方法
>>> text
>>> text[:20]
[u'\ufeffThe', u'Project', u'Gutenberg', u'EBook', u'of', u'Crime', u'and', u'Punishment', u',', u'by', u'Fyodor', u'Dostoevsky', u'This', u'eBook', u'is', u'for', u'the', u'use', u'of', u'anyone']
6.2 处理HTML
通过网络爬虫抓取的网页都是HTML格式的,HTML文件中存在大量的标签、js、表单等与我们实际想要的网页内容并不相干。这里我们使用BeautifulSoup
来从html文件中提取网页内容,以供我们使用,使用方法如下:
>>> from urllib import urlopen
>>> from bs4 import BeautifulSoup
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html[:60]
'>> raw = soup.text
>>> raw[:60]
u"\n\nBBC NEWS | Health | Blondes 'to die out in 200 years'\n\n\n\n\n"
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)
>>> text.concordance("gene")
Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next
blonde hair is caused by a recessive gene . In order for a child to have blond
have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin
er's Polio campaign launched in Iraq Gene defect explains high blood pressure
er's Polio campaign launched in Iraq Gene defect explains high blood pressure
6.3 使用正则表达式
语言处理都涉及模式匹配问题,这就需要使用python中的正则表达式模块re。这里就不对re模块进行赘述,详细的re模块介绍请点击python正则表达式。接下来,介绍几个使用re模块进行字符串匹配的例子:
6.3.1 正则表达式简单示例
>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> [w for w in wsj if re.search(r'^[0-9]+\.[0-9]+$', w)] #匹配小数
[u'0.0085', u'0.05', u'0.1', u'0.16', u'0.2', u'0.25', u'0.28', u'0.3', u'0.4', u'0.5', u'0.50', u'0.54', u'0.56', u'0.60', u'0.7', u'0.82', u'0.84', u'0.9', u'0.95', u'0.99', ...]
>>> [w for w in wsj if re.search(r"^[0-9]{4}$", w)] #匹配四位整数
[u'1614', u'1637', u'1787', u'1901', u'1903', u'1917', u'1925', u'1929', u'1933', u'1934', u'1948', u'1953', u'1955', u'1956', u'1961', u'1965', u'1966', u'1967', u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1975', u'1976', u'1977', u'1979', u'1980', u'1981', u'1982', u'1983', u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991', u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999', u'2000', u'2005', u'2009', u'2017', u'2019', u'2029', u'3057', u'8300']
>>> [w for w in wsj if re.search(r"^[0-9]+-[a-z]{3,5}$", w)] #匹配数字-单词(长度3-5)
[u'10-day', u'10-lap', u'10-year', u'100-share', u'12-point', u'12-year', u'14-hour', u'15-day', u'150-point', u'190-point', u'20-point', u'20-stock', u'21-month', u'237-seat', u'240-page', u'27-year', u'30-day', u'30-point', u'30-share', u'30-year', u'300-day', u'36-day', u'36-store', u'42-year', u'50-state', u'500-stock', u'52-week', u'69-point', u'84-month', u'87-store', u'90-day']
>>> [w for w in wsj if re.search(r"[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$", w)]
[u'black-and-white', u'bread-and-butter', u'father-in-law', u'machine-gun-toting', u'savings-and-loan']
>>> [w for w in wsj if re.search(r"(ed|ing)$", w)][:20] #匹配ed或者ing结尾的词
[u'62%-owned', u'Absorbed', u'According', u'Adopting', u'Advanced', u'Advancing', u'Alfred', u'Allied', u'Annualized', u'Anything', u'Arbitrage-related', u'Arbitraging', u'Asked', u'Assuming', u'Atlanta-based', u'Baking', u'Banking', u'Beginning', u'Beijing', u'Being', ...]
6.3.2 使用re.findall()
提取字符块
- 提取两个或者两个以上的元音序列
>>> wsj = sorted(set(nltk.corpus.treebank.words()))
>>> fd = nltk.FreqDist(vs for word in wsj
... for vs in re.findall(r"[aeiou]{2,}", word))
>>> fd.items()[:20]
[(u'aa', 3), (u'eo', 39), (u'ei', 86), (u'ee', 217), (u'ea', 476), (u'oui', 6), (u'ao', 6), (u'eu', 18), (u'au', 106), (u'io', 549), (u'ia', 253), (u'ae', 11), (u'ie', 331), (u'iao', 1), (u'iai', 1), (u'uou', 5), (u'ieu', 3), (u'ai', 261), (u'aii', 1), (u'uee', 4)]
- 去掉英文词内部的元音
英文文本是高度冗余的,忽略掉词内部的元音,也可以轻松阅读。如下示例,正则表达式从左到右依次匹配词首元音序列、词尾元音序列和所有辅音,其他都被忽略。
>>> ptn = re.compile(r"^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]")
>>> def compress(word):
... pieces = ptn.findall(word)
... return "".join(pieces)
...
>>> english_udhr = nltk.corpus.udhr.words("English-Latin1")
>>> nltk.tokenwrap(compress(w) for w in english_udhr[:100])
u'Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and\nof the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn\nof frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn\nrghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,\nand the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and\nblf and frdm frm fr and wnt hs bn prclmd as the hghst asprtn of the\ncmmn pple , Whrs it is essntl , if'
- 提取辅音-元音序列
从罗托卡特语词汇中提取所有辅音-元音序列
>>> rotokas_words = nltk.corpus.toolbox.words("rotokas.dic")
>>> cvs = [cv for w in rotokas_words for cv in re.findall(r"[ptksvr][aeiou]", w)]
>>> cfg = nltk.ConditionalFreqDist(cvs)
>>> cfg.tabulate()
a e i o u
k 418 148 94 420 173
p 83 31 105 34 51
r 187 63 84 89 79
s 0 0 100 2 1
t 47 8 0 148 37
v 93 27 105 48 49
- 查找包含某个辅音-元音对应的单词列表
>>> cv_word_pairs = [(cv, w) for w in rotokas_words
... for cv in re.findall(r"[ptksvr][aeiou]", w)]
>>> cv_index = nltk.Index(cv_word_pairs)
>>> cv_index["su"]
[u'kasuari']
>>> cv_index["po"]
[u'kaapo', u'kaapopato', u'kaipori', u'kaiporipie', u'kaiporivira', u'kapo', u'kapoa', u'kapokao', u'kapokapo', u'kapokapo', u'kapokapoa', u'kapokapoa', u'kapokapora', u'kapokapora', u'kapokaporo', u'kapokaporo', u'kapokari', u'kapokarito', u'kapokoa', u'kapoo', u'kapooto', u'kapoovira', u'kapopaa', u'kaporo', u'kaporo', u'kaporopa', u'kaporoto', u'kapoto', u'karokaropo', u'karopo', u'kepo', u'kepoi', u'keposi', u'kepoto']
6.4 规范化文本
规范化文本一般指去掉所有的词缀、提取词干或者词形还原的任务等。
其中词干提取(stemming)是抽取词的词干或者词根形式(并不一定能够表达完整的语义,比如将effective处理成effect)。
而词形还原(lemmatization)是把词汇的任意形式还原为一般形式(能够表达完整的语义,比如将一个单词的过去形式转化为一般形式,将drove处理为drive)。
6.4.1 词干提取
在NLTK包中提供集中常用的词干提取接口:Porter stemmer、Lancaster stemmer 和Snowball stemmer。使用示例如下:
nltk.PorterStremmer
>>> from nltk import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem('maximum')
'maximum'
>>> porter_stemmer.stem('presumably')
u'presum'
>>> porter_stemmer.stem('multiply')
u'multipli'
>>> porter_stemmer.stem('provision')
u'provis'
>>> porter_stemmer.stem('owed')
u'owe'
nltk.LancasterStemmer
>>> from nltk import LancasterStemmer
>>> lancaster_stemmer = LancasterStemmer()
>>> lancaster_stemmer.stem('maximum')
'maxim'
>>> lancaster_stemmer.stem('presumably')
'presum'
>>> lancaster_stemmer.stem('multiply')
'multiply'
>>> lancaster_stemmer.stem('provision')
u'provid'
>>> lancaster_stemmer.stem('owed')
'ow'
nltk.SnowballStemmer
>>> from nltk import SnowballStemmer
>>> snowball_stemmer = SnowballStemmer("english")
>>> snowball_stemmer.stem("maximum")
u'maximum'
>>> snowball_stemmer.stem("presumably")
u'presum'
>>> snowball_stemmer.stem("provision")
u'provis'
>>> snowball_stemmer.stem("owed")
u'owe'
一般使用nltk.PorterStremmer
6.4.2 词形还原
NLTK包中提供了Lemmatization词形还原接口,它可以将语言词汇还原为一般形式,在标记词性的前提下效果非常好,示例如下:
nltk.WordNetLemmatizer
>>> from nltk import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize("cars")
u'car'
>>> lmtzr.lemmatize("feet")
u'foot'
>>> lmtzr.lemmatize("people")
'people'
>>> lmtzr.lemmatize("fantasized", pos="v")
u'fantasize'