学习参考书: http://nltk.googlecode.com/svn/trunk/doc/book/
1. 使用代理下载数据
nltk.set_proxy("**.com:80")
nltk.download()
2. 使用sents(fileid)函数时候出现:Resource 'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource:
import nltk
nltk.download()
安装窗口中选择'Models'项,然后'在 'Identifier' 列找 'punkt,点击下载安装该数据包
3. 语料Corpus元素获取函数
from nltk.corpus import webtext
webtext.fileids() #得到语料中所有文件的id集合
webtext.raw(fileid) #给定文件的所有字符集合
webtext.words(fileid) #所有单词集合
webtext.sents(fileid) #所有句子集合
Example | Description |
---|---|
fileids() | the files of the corpus |
fileids([categories]) | the files of the corpus corresponding to these categories |
categories() | the categories of the corpus |
categories([fileids]) | the categories of the corpus corresponding to these files |
raw() | the raw content of the corpus |
raw(fileids=[f1,f2,f3]) | the raw content of the specified files |
raw(categories=[c1,c2]) | the raw content of the specified categories |
words() | the words of the whole corpus |
words(fileids=[f1,f2,f3]) | the words of the specified fileids |
words(categories=[c1,c2]) | the words of the specified categories |
sents() | the sentences of the whole corpus |
sents(fileids=[f1,f2,f3]) | the sentences of the specified fileids |
sents(categories=[c1,c2]) | the sentences of the specified categories |
abspath(fileid) | the location of the given file on disk |
encoding(fileid) | the encoding of the file (if known) |
open(fileid) | open a stream for reading the given corpus file |
root() | the path to the root of locally installed corpus |
readme() | the contents of the README file of the corpus |
4.文本处理的一些常用函数
假若text是单词集合的列表
len(text) #单词个数
set(text) #去重
sorted(text) #排序
text.count('a') #数给定的单词的个数
text.index('a') #给定单词首次出现的位置
FreqDist(text) #单词及频率,keys()为单词,*[key]得到值
FreqDist(text).plot(50,cumulative=True) #画累积图
bigrams(text) #所有的相邻二元组
text.collocations() #找文本中频繁相邻二元组
text.concordance("word") #找给定单词出现的位置及上下文
text.similar("word") #找和给定单词语境相似的所有单词
text.common_context("a“,"b") #找两个单词相似的上下文语境
text.dispersion_plot(['a','b','c',...]) #单词在文本中的位置分布比较图
text.generate() #随机产生一段文本
NLTK's Conditional Frequency Distributions: commonly-used methods and idioms for defining,accessing, and visualizing a conditional frequency distribution.of counters.
Example | Description |
---|---|
cfdist = ConditionalFreqDist(pairs) | create a conditional frequency distribution from a list of pairs |
cfdist.conditions() | alphabetically sorted list of conditions |
cfdist[condition] | the frequency distribution for this condition |
cfdist[condition][sample] | frequency for the given sample for this condition |
cfdist.tabulate() | tabulate the conditional frequency distribution |
cfdist.tabulate(samples, conditions) | tabulation limited to the specified samples and conditions |
cfdist.plot() | graphical plot of the conditional frequency distribution |
cfdist.plot(samples, conditions) | graphical plot limited to the specified samples and conditions |
cfdist1 < cfdist2 | test if samples in cfdist1 occur less frequently than in cfdist2 |
to be continued