NLTK中的条件概率分布

产生一个文本，一般要基于一个已有的训练集，或者说是种子，来告诉程序词汇的分布以及用词习惯，下面是一个最为基础的文本产生函数，基于nltk的条件频率分布函数构建：

def generate_model(cfd, word, num=15):
    for i in range (num):
        print word #输出当前词汇        
        word = cfd[word].max() #该词汇的下一个"最有可能"与之联结的词汇, 并替代当前词汇，使之输入到下一次循环当中
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

这里想详细说说nltk.ConditionalFreqDist这个函数，个人感觉这个函数意义非凡。该函数是频率分布的集合，比如，我们想统计在新闻文体中和言情小说文体中给定词的频率分布，那么这里的“新闻”以及“言情小说”就是两个条件，而给定的词，就是我们观察到的事件。在一个ConditionalFreqDist函数中，(条件，事件)的集合，就是输入的argument，比如：
cfd = ConditionalFreqDist(条件，事件)

举例说明，我们想知道brown语料库中，news和romance两种文学体裁的词频分布，那么我们可以使用如下代码:

#我们先设置（条件，事件）的集合：
genre_word= [(genre, word) for genre in ['news','romance'] for word in brown.words(categories = genre)]

#输出条件频率
cfdist = nltk.ConditionalFreqDist(genre_word)
#这个函数的输出，事实上是有“news”以及“romance”条件的counter default字典，下面是一部分#output:
#defaultdict(nltk.probability.FreqDist,
            {'news': Counter({u'sunbonnet': 1,
                      u'Elevated': 1,
                      u'narcotic': 2,
                      u'four': 73,
                      u'woods': 4,
                      u'railing': 1,
                      u'Until': 5,
#我们可以进一步的切片这个结果：
news = cfdist['news’]

news_four = cfdist['news']['four’] #cfdist[条件][事件]
Out[39]: 73

除此之外，我们还可以对cfdist做一写表达式处理，比如tabulate或者plot：

In[44]: cfdist.tabulate(conditions = ['news'],samples = ['four'])
     four
news   73
In[45]: cfdist.tabulate(samples = ['four'])
        four
   news   73
romance    8
In[46]: cfdist.tabulate(samples = ['I','love','you'])
           I love  you
   news  179    3   55
romance  951   32  456

#我们也可以让他显示百分比而不是counts:

cfdist_copy = cfdist
total_news = cfdist['news'].N()
total_romance = cfdist['romance'].N()

for i in cfdist_copy['news']:
    cfdist_copy['news'][i] = float(cfdist_copy['news'][i])/float(total_news)

for j in cfdist_copy['romance']:
    cfdist_copy['romance'][j] = float(cfdist_copy['romance'][j])/ float(total_romance)

print cfdist['romance']['I']
Out[78]: 0.013581445831310159

我们也可以对结果进行画图，使之更加浅显易懂：

cfdist.plot(samples = [‘I’, ‘love’, ‘you’])

接下来，我们还可以利用CFD做一些更有趣的事情，比如自动生成一个文本, 即该文一开头的例子，这里我们有言情小说来构建一篇更有趣的“电脑写的言情小说”：

from nltk.corpus import brown

def generate_romance(rcfdist, word, num = 100):
    for i in range(num):
        print word
        word = rcfdist[word].max()

refined = [w for w in brown.words(categories = 'romance') if w.isalpha()]
bigrams = nltk.bigrams(refined)
rcfdist = nltk.ConditionalFreqDist(bigrams)

generate_romance(rcfdist,’love’)

output:
love
you
have
to
the
same
time
to
the
same
time
to
the

可以看到，这个程序实际上存在很大问题，因为某些bigrams一旦出现固定循环，程序就会不停的在这个循环内滚动，不过这样运用条件概率分布的例子，仍然对我们是有启发性的。

NLTK中的条件概率分布

你可能感兴趣的:(NLTK中的条件概率分布)