学习篇:新闻摘要提取算法

新闻摘要内容提取的算法如下:

1.按照算法对文本中的单词计算重要性,将符合阈值的设为关键字

2.按照句子中单词的重要性给句子计算重要性

3.按照句子的重要性为其排序

4.取出top-k个句子为摘要

准备工作:
from nltk.tokenize import sent_tokenize,word_tokenize

from nltk.corpus import stopwords

from collections import defaultdict

from string import punctuation

from heapq import nlargest

stopwords = set(stopwords.words('english') + list(punctuation))

max_cut = 0.9

min_cut = 0.1

这里说一下punctuationnlargest

punctuation 是一个列表,包含了英文中的标点和符号。

nlargest() 函数可以很快地求出一个容器中最大的n个数字,排序方式是堆排序。

步骤一:

计算单词重要性:

def compute_frequencies(word_sent):
    freq = defaultdict(int)

    for s in word_sent:
        for word in s:

            if word not in stopwords:
                freq[word] += 1


    m = float(max(freq.values()))

    for w in freq.keys():
        freq[w] /= m
        if freq[w] >= max_cut or freq[w] <= min_cut:
            del freq[w]


    return freq
步骤二:

计算句子重要性:

def summarize(text,n):

   sents = sent_tokenize(text)
   assert n <= len(sents)

   word_sent = [word_tokenize(s.lower()) for s in sents]

   freq = compute_frequencies(word_sent)

   ranking = defaultdict(int)

   for i ,word in enumerate(word_sent):

       for w in word:

           if w in freq:
               ranking[i] += freq[w]

   sents_idx = rank(ranking,n)

   return [sents[j] for j in sents_idx]
 
步骤三:

排序:

def rank(ranking,n):

   return nlargest(n,ranking,key=ranking.get)

你可能感兴趣的:(学习篇:新闻摘要提取算法)