背景
使用过大名鼎鼎的NLP工具包NLTK的同学们都知道, 自从NLTK更新到3.0版本后, 子包'model'被移除了. 原因是各种依赖的接口有较大调整, 子包'model'的迁移出现问题, 被维护者暂时移除但又迟迟没有合并回去. 这是十分可惜的事情, 因为其中包括我们常用的Ngram模型!
不过, 对应地维护者在'model'分支上提供了Ngram模型的基类 BaseNgramModel`, 使用者可以通过这个基类实现自己的模型. 作者根据此基类, 实现递归NgramCounter, 进而重新实现了2.x版本的Katz backoff平滑Ngrams模型. 代码保存在github. 下面, 作者会对实现过程做些简单介绍.
BaseNgramModel
我们先来看看 BaseNgramModel
长什么样子:
@compat.python_2_unicode_compatible
class BaseNgramModel(object):
"""An example of how to consume NgramCounter to create a language model.
This class isn't intended to be used directly, folks should inherit from it
when writing their own ngram models.
"""
def __init__(self, ngram_counter):
self.ngram_counter = ngram_counter
# for convenient access save top-most ngram order ConditionalFreqDist
self.ngrams = ngram_counter.ngrams[ngram_counter.order]
self._ngrams = ngram_counter.ngrams
self._order = ngram_counter.order
self._check_against_vocab = self.ngram_counter.check_against_vocab
def check_context(self, context):
"""Makes sure context not longer than model's ngram order and is a tuple."""
if len(context) >= self._order:
raise ValueError("Context is too long for this ngram order: {0}".format(context))
# ensures the context argument is a tuple
return tuple(context)
def score(self, word, context):
"""
This is a dummy implementation. Child classes should define their own
implementations.
:param word: the word to get the probability of
:type word: str
:param context: the context the word is in
:type context: Tuple[str]
"""
return 0.5
def logscore(self, word, context):
"""
Evaluate the log probability of this word in this context.
This implementation actually works, child classes don't have to
redefine it.
:param word: the word to get the probability of
:type word: str
:param context: the context the word is in
:type context: Tuple[str]
"""
score = self.score(word, context)
if score == 0.0:
return NEG_INF
return log(score, 2)
def entropy(self, text):
"""
Calculate the approximate cross-entropy of the n-gram model for a
given evaluation text.
This is the average log probability of each word in the text.
:param text: words to use for evaluation
:type text: Iterable[str]
"""
normed_text = (self._check_against_vocab(word) for word in text)
H = 0.0 # entropy is conventionally denoted by "H"
processed_ngrams = 0
for ngram in self.ngram_counter.to_ngrams(normed_text):
context, word = tuple(ngram[:-1]), ngram[-1]
H += self.logscore(word, context)
processed_ngrams += 1
return - (H / processed_ngrams)
def perplexity(self, text):
"""
Calculates the perplexity of the given text.
This is simply 2 ** cross-entropy for the text.
:param text: words to calculate perplexity of
:type text: Iterable[str]
"""
return pow(2.0, self.entropy(text))
可以看到, 要继承这个类重新实现NgramModel, 我们有两大任务:
- 实现初始化参数
ngram_counter
- 派生类要覆盖
score
方法
NgramCounter
从上面的代码我们可以看到, 参数ngram_counter
的类必须实现以下属性和方法:
- order: 属性, int, 模型阶数
- ngrams: 属性, dict
, 各阶模型的条件概率分布的集合 - vocabulary: 属性, set
>, ngram词汇表 - to_gram: 方法, (list
)-> yield tuple , 通过输入文本生成ngram - check_against_vocab: 方法, (str)-> str, 根据词汇表对单词做映射
小菜一叠, 唯独需要注意的里面的低阶模型的递归生成, 因为我们要靠这个数据结构实现Katz backoff平滑模型. 另外顺便一提, 尽管python的类属性没有公有私有的区别, 但是大家尽可能不要外部直接访问类属性, 应该用@property
和@xxx.setter
保护起来, 道理大家懂的. 实现代码如下:
class NgramCounter(object):
"""
依据 NLTK 3.0 给出的模型基类'BaseNgramModel'所实现的NgramCounter
必要成员属性和方法
- order: 属性, int, 模型阶数
- ngrams: 属性, dict, 各界模型的条件概率分布的集合
- vocabulary: 属性, set>, ngram词汇表
- to_gram: 方法, (list)-> yield tuple, 通过输入文本生成ngram
- check_against_vocab: 方法, (str)-> str, 根据词汇表对单词做映射
"""
def __init__(self, order: int, train: list,
pad_left: bool=True, pad_right: bool =False, left_pad_symbol: str ='', right_pad_symbol: str ='',
recursive: bool =True):
"""
:param order: 模型阶数
:param train: 训练样本
:param pad_left: 是否进行左填充
:param pad_right: 是否进行右填充
:param left_pad_symbol: 左填充符号
:param right_pad_symbol: 右填充符号
:param recursive: 是否生成低阶模型
"""
self._ngrams = dict()
# 模型阶数必须大于0
assert (order > 0), order
# 保存模型阶数
self._order = order
# 为方便检查, 为n=1的1阶模型保存一个快捷变量
# padding的设置
assert (isinstance(pad_left, bool))
assert (isinstance(pad_right, bool))
self._pad_left = pad_left
self._pad_right = pad_right
self._left_pad_symbol = left_pad_symbol
self._right_pad_symbol = right_pad_symbol
cfd = ConditionalFreqDist()
self._vocabulary = set()
# 输入适配. 如果输入的训练数据不是list>, 用一个列表包裹它
if (train is not None) and isinstance(train[0], compat.string_types):
train = [train]
for sent in train:
for ngram in self.to_ngrams(sent):
self._vocabulary.add(ngram)
context = tuple(ngram[:-1])
token = ngram[-1]
# NB, ConditionalFreqDist的接口已经改变, 已经没有方法'inc', 需要改为如下语句
cfd[context][token] += 1
self._ngrams[self._order] = cfd
# NB, 关键代码: 递归生成低阶NgramCounter
# 如果递归, 那就生成低阶概率分布, 注意还要把order-2至1阶的概率分布取回来
if recursive and not order == 1:
self._backoff = NgramCounter(order - 1, train,
pad_left=pad_left, left_pad_symbol=left_pad_symbol,
pad_right=pad_right, right_pad_symbol=right_pad_symbol)
# 递归地把个低阶概率分布取回来
cursor = self._backoff
while cursor is not None:
self._ngrams[cursor.order] = cursor.ngrams[cursor.order]
cursor = cursor.backoff
else:
self._backoff = None
@property
def order(self) -> int:
return self._order
@property
def vocabulary(self) -> set:
return self._vocabulary
@property
def ngrams(self) -> dict:
return self._ngrams
@property
def backoff(self) -> type('NgramCounter'):
return self._backoff
def check_against_vocab(self, word) -> str:
"""
目前不对生词作任何处理
:param word:
"""
return word
def to_ngrams(self, text) -> tuple:
return ngrams(text, self._order,
pad_left=self._pad_left, pad_right=self._pad_right,
left_pad_symbol=self._left_pad_symbol, right_pad_symbol=self._right_pad_symbol)
NgramModel
有了可以递归的NgramCounter
, 我们就可以继承BaseNgramModel
复活NgramModel
. 需要注意的两点是:
- 先调父类的构造函数, 因为它初始化了各种属性
- 注意低阶模型的递归
Talk is cheap, show me the code:
class NgramModel(BaseNgramModel):
"""
继承模型基类'BaseNgramModel'重新实现NgramModel
Note:
1. 原方法'prob'和'logprob'已分别改名为'score'和'logstore'
2. 原方法'entropy'显式对输入文本进行padding, 然而基类'BaseNgramModel'的'entorpy'没有.
但是, 基类'BaseNgramModel'的'entorpy'的调用'NgramCounter'to_ngram, 已经进行padding.
所以我们不需要覆盖'entropy'
"""
def __init__(self, ngram_counter, estimator=None, *estimator_args, **estimator_kwargs):
super(NgramModel, self).__init__(ngram_counter)
# 设置频率平滑器, 没有就使用默认
if estimator is None:
estimator = _estimator
# 使用频率平滑器, 生成ngram模型
if not estimator_args and not estimator_kwargs:
self._model = ConditionalProbDist(self.ngrams, estimator, len(self.ngrams))
else:
self._model = ConditionalProbDist(self.ngrams, estimator, *estimator_args, **estimator_kwargs)
# 递归生成低阶模型
if self._order > 1 and self.ngram_counter.backoff is not None:
self._backoff = NgramModel(self.ngram_counter.backoff, estimator, *estimator_args, **estimator_kwargs)
def score(self, word, context):
"""
Evaluate the probability of this word in this context using Katz Backoff.
:param word: the word to get the probability of
:type word: str
:param context: the context the word is in
:type context: list(str)
"""
context = tuple(context)
# NB, 属性'_ngrams'已经在基类'BaseNgramModel'被赋值为'NgramCounter'的ConditionalFreqDist集合.
# 词汇表实际上是NgramCounter的属性'vocabulary'. 具体修改如下
# if (context + (word,) in self._ngrams) or (self._n == 1):
if (context + (word,) in self.ngram_counter.vocabulary) or (self._order == 1):
return self[context].prob(word)
else:
return self._alpha(context) * self._backoff.score(word, context[1:])
def _alpha(self, tokens):
return self._beta(tokens) / self._backoff._beta(tokens[1:])
def _beta(self, tokens):
return self[tokens].discount() if tokens in self else 1
def choose_random_word(self, context):
"""
Randomly select a word that is likely to appear in this context.
:param context: the context the word is in
:type context: list(str)
"""
return self.generate(1, context)[-1]
# NB, this will always start with same word if the model
# was trained on a single text
def generate(self, num_words, context=()):
"""
Generate random text based on the language model.
:param num_words: number of words to generate
:type num_words: int
:param context: initial words in generated string
:type context: list(str)
"""
text = list(context)
for i in range(num_words):
text.append(self._generate_one(text))
return text
def _generate_one(self, context):
context = (self._lpad + tuple(context))[-self._n + 1:]
if context in self:
return self[context].generate()
elif self._n > 1:
return self._backoff._generate_one(context[1:])
else:
return '.'
def __contains__(self, item):
return tuple(item) in self._model
def __getitem__(self, item):
return self._model[tuple(item)]
def __repr__(self):
return '' % (len(self._ngrams), self._n)
结语
复活的模型和原2.x中的模型计算结果完全一致, 大家可以自行测试, 或直接运行github上的代码测试.