python自然语言处理-学习笔记(一)之nltk入门

nltk学习第一章

一,入门

1,nltk包的导入和报的下载

import nltk

nltk.download() (eg: nltk.download(‘punkt’),也可以指定下载那个包)

2,book图书集,是一些数据,

from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, …, text9 and sent1, …, sent9
Type the name of the text or sentence to view it.
Type: ‘texts()’ or ‘sents()’ to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

text1

text2

二,搜索文本

1,上下文检索器

text1.concordance(‘eg’)
在text1中检索所有出现‘eg’的上下文

text1.concordance(‘monstrous’)
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . … This came towards
us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we
have r
ll over with a heathenish array of monstrous clubs and spears . Some were
thick
d as you gazed , and wondered what monstrous cannibal and savage could ev
er hav
that has survived the flood ; most monstrous and most mountainous ! That
Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and m
ore de
th of Radney .’" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall
ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am st
rongly
ere to enter upon those still more monstrous stories of them which are to
be fo
ght have been rummaged out of this monstrous cabinet there is no telling
. But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up
dead u

2,有语境相似的上下文时

text1.similar(‘monstrous’)

text1.similar(‘monstrous’)
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless

3,有两个单词相似的上下文语境

text2.common_contexts([‘monstrous’,‘very’])

text2.common_contexts([‘monstrous’,‘very’])
a_pretty am_glad a_lucky is_pretty be_glad

4,随机的选择文本

text3.generate() 但是说缺一个参数(在nltk3.x版本已经注销了这个函数)

三,频率及计数

1,画出随时间推移语言使用上的变化

dispersion_plot([“citizens”,“democracy”,“duties”,“America”])
text4.dispersion_plot([“citizens”,“democracy”,“freedom”,“duties”,“Ame
rica”])

2,寻找字符串中出现的词(去重之后),结果以字典的形式显示,需要首先导入 from nltk.book import *,否则会报错

f = FreqDist(‘str’)
f

3,FreqDist其他函数

fdist= FreqDist(samples) 创建包含给定样本的频率分布

fdist.inc(sample) 增加样本

fdist[‘monstrous’] 计数给定样本出现的次数

fdist.freq(‘monstrous’) 给定样本的频率

fdist.N() 样本总数

fdist.keys() 以频率递减顺序排序的样本链表

for sample in fdist: 以频率递减的顺序遍历样本

fdist.max() 数值最大的样本

fdist.tabulate() 绘制频率分布表

fdist.plot() 绘制频率分布图

4,字符串中只出现了一次的字符

f.hapaxes()

四,词语搭配和双连词

1,寻找文本中双连频率最高的词语搭配

text4.collocations()

text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

2,找词语搭配

bigrams([‘s’,‘e’,‘f’]) (打印出来却是函数的地址如何调用他?查询得知需要加list()函数)

list(bigrams([‘my’,‘name’,‘is’,‘swt’]))
[(‘my’, ‘name’), (‘name’, ‘is’), (‘is’, ‘swt’)]

五,词意消岐

后面有个函数 babelize_shell()

输入后会报错,说函数名没有被定义,是因为现版本中取消了这个函数

六,人机对话系统

nltk.chat.chatbots()

与chatbot对话,这个挺好玩的,你可以和他对话聊天,但还是不够智能,像个傻子一样(嘻嘻嘻哈哈)

你可能感兴趣的:(自然语言处理)