NLTK 入门

NLTK 入门

from matplotlib import pyplot as plt
from nltk import book
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
book.text1

# 搜索相关词
book.text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
# 查看相似上下文的词语。例如, the ___ pictures和the ___ size. 上下文一样的词.
book.text1.similar("monstrous")
imperial subtly impalpable pitiable curious abundant perilous
trustworthy untoward singular lamentable few determined maddens
horrible tyrannical lazy mystifying christian exasperate
# common_contexts 找出两个或两个以上词共同的上下文. 中间用 _ 分隔两个词.
book.text2.common_contexts(["monstrous", "very"])
print '-'*100
book.text2.common_contexts(["monstrous"])
a_pretty is_pretty a_lucky am_glad be_glad
----------------------------------------------------------------------------------------------------
a_pretty was_happy is_fond a_lucky a_deal am_glad is_pretty be_glad

查看文本中每一个出现的词的分布情况。其中 x轴表示每一个词出现的位置,能够看出一个词在文章的分布情况。

book.text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
plt.show()

NLTK 入门_第1张图片

计算某个词的个数

book.text3.count("smote")
5

频率统计. 产生的fd并没有被排序,如果需要统计词频最高的,使用 most_common 来获取. 总之这是一个字典。

from nltk import probability
fd = probability.FreqDist(book.text1)
fd
words = fd.keys()
print words[0:50]

w_str = ''
for w in words[0:10]:
    w_str += str(fd[w]) + ' '
print w_str
print fd['whale']
fd.most_common(50)
[u'funereal', u'unscientific', u'divinely', u'foul', u'four', u'gag', u'prefix', u'woods', u'clotted', u'Duck', u'hanging', u'plaudits', u'woody', u'Until', u'marching', u'disobeying', u'canes', u'granting', u'advantage', u'Westers', u'insertion', u'DRYDEN', u'formless', u'Untried', u'superficially', u'Western', u'portentous', u'beacon', u'meadows', u'sinking', u'Ding', u'Spurn', u'treasuries', u'churned', u'oceans', u'powders', u'tinkerings', u'tantalizing', u'yellow', u'bolting', u'uncertain', u'stabbed', u'bringing', u'elevations', u'ferreting', u'believers', u'wooded', u'songster', u'uttering', u'scholar']
1 1 2 11 74 2 1 9 2 2 
906





[(u',', 18713),
 (u'the', 13721),
 (u'.', 6862),
 (u'of', 6536),
 (u'and', 6024),
 (u'a', 4569),
 (u'to', 4542),
 (u';', 4072),
 (u'in', 3916),
 (u'that', 2982),
 (u"'", 2684),
 (u'-', 2552),
 (u'his', 2459),
 (u'it', 2209),
 (u'I', 2124),
 (u's', 1739),
 (u'is', 1695),
 (u'he', 1661),
 (u'with', 1659),
 (u'was', 1632),
 (u'as', 1620),
 (u'"', 1478),
 (u'all', 1462),
 (u'for', 1414),
 (u'this', 1280),
 (u'!', 1269),
 (u'at', 1231),
 (u'by', 1137),
 (u'but', 1113),
 (u'not', 1103),
 (u'--', 1070),
 (u'him', 1058),
 (u'from', 1052),
 (u'be', 1030),
 (u'on', 1005),
 (u'so', 918),
 (u'whale', 906),
 (u'one', 889),
 (u'you', 841),
 (u'had', 767),
 (u'have', 760),
 (u'there', 715),
 (u'But', 705),
 (u'or', 697),
 (u'were', 680),
 (u'now', 646),
 (u'which', 640),
 (u'?', 637),
 (u'me', 627),
 (u'like', 624)]
fd.plot(50)
plt.show()

NLTK 入门_第2张图片

fd.plot(50, cumulative=True)
plt.show()

NLTK 入门_第3张图片

n-gram

使用collections 获取 n-gram的数据。下面是默认n-gram=2

book.text4.collocations(window_size=2)
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

你可能感兴趣的:(nlp)