复活NgramModel!-继承'BaseNgramModel'重新实现

背景

使用过大名鼎鼎的NLP工具包NLTK的同学们都知道, 自从NLTK更新到3.0版本后, 子包'model'被移除了. 原因是各种依赖的接口有较大调整, 子包'model'的迁移出现问题, 被维护者暂时移除但又迟迟没有合并回去. 这是十分可惜的事情, 因为其中包括我们常用的Ngram模型!

不过, 对应地维护者在'model'分支上提供了Ngram模型的基类 BaseNgramModel`, 使用者可以通过这个基类实现自己的模型. 作者根据此基类, 实现递归NgramCounter, 进而重新实现了2.x版本的Katz backoff平滑Ngrams模型. 代码保存在github. 下面, 作者会对实现过程做些简单介绍.

BaseNgramModel

我们先来看看 BaseNgramModel 长什么样子:

@compat.python_2_unicode_compatible
class BaseNgramModel(object):
    """An example of how to consume NgramCounter to create a language model.
    This class isn't intended to be used directly, folks should inherit from it
    when writing their own ngram models.
    """

    def __init__(self, ngram_counter):

        self.ngram_counter = ngram_counter
        # for convenient access save top-most ngram order ConditionalFreqDist
        self.ngrams = ngram_counter.ngrams[ngram_counter.order]
        self._ngrams = ngram_counter.ngrams
        self._order = ngram_counter.order

        self._check_against_vocab = self.ngram_counter.check_against_vocab

    def check_context(self, context):
        """Makes sure context not longer than model's ngram order and is a tuple."""
        if len(context) >= self._order:
            raise ValueError("Context is too long for this ngram order: {0}".format(context))
        # ensures the context argument is a tuple
        return tuple(context)

    def score(self, word, context):
        """
        This is a dummy implementation. Child classes should define their own
        implementations.
        :param word: the word to get the probability of
        :type word: str
        :param context: the context the word is in
        :type context: Tuple[str]
        """
        return 0.5

    def logscore(self, word, context):
        """
        Evaluate the log probability of this word in this context.
        This implementation actually works, child classes don't have to
        redefine it.
        :param word: the word to get the probability of
        :type word: str
        :param context: the context the word is in
        :type context: Tuple[str]
        """
        score = self.score(word, context)
        if score == 0.0:
            return NEG_INF
        return log(score, 2)

    def entropy(self, text):
        """
        Calculate the approximate cross-entropy of the n-gram model for a
        given evaluation text.
        This is the average log probability of each word in the text.
        :param text: words to use for evaluation
        :type text: Iterable[str]
        """

        normed_text = (self._check_against_vocab(word) for word in text)
        H = 0.0     # entropy is conventionally denoted by "H"
        processed_ngrams = 0
        for ngram in self.ngram_counter.to_ngrams(normed_text):
            context, word = tuple(ngram[:-1]), ngram[-1]
            H += self.logscore(word, context)
            processed_ngrams += 1
        return - (H / processed_ngrams)

    def perplexity(self, text):
        """
        Calculates the perplexity of the given text.
        This is simply 2 ** cross-entropy for the text.
        :param text: words to calculate perplexity of
        :type text: Iterable[str]
        """

        return pow(2.0, self.entropy(text))

可以看到, 要继承这个类重新实现NgramModel, 我们有两大任务:

  1. 实现初始化参数ngram_counter
  2. 派生类要覆盖score方法

NgramCounter

从上面的代码我们可以看到, 参数ngram_counter的类必须实现以下属性和方法:

  • order: 属性, int, 模型阶数
  • ngrams: 属性, dict, 各阶模型的条件概率分布的集合
  • vocabulary: 属性, set>, ngram词汇表
  • to_gram: 方法, (list)-> yield tuple, 通过输入文本生成ngram
  • check_against_vocab: 方法, (str)-> str, 根据词汇表对单词做映射

小菜一叠, 唯独需要注意的里面的低阶模型的递归生成, 因为我们要靠这个数据结构实现Katz backoff平滑模型. 另外顺便一提, 尽管python的类属性没有公有私有的区别, 但是大家尽可能不要外部直接访问类属性, 应该用@property@xxx.setter保护起来, 道理大家懂的. 实现代码如下:

class NgramCounter(object):
    """
    依据 NLTK 3.0 给出的模型基类'BaseNgramModel'所实现的NgramCounter

    必要成员属性和方法
    - order: 属性, int, 模型阶数
    - ngrams: 属性, dict, 各界模型的条件概率分布的集合
    - vocabulary: 属性, set>, ngram词汇表
    - to_gram: 方法, (list)-> yield tuple, 通过输入文本生成ngram
    - check_against_vocab: 方法, (str)-> str, 根据词汇表对单词做映射

    """
    def __init__(self, order: int, train: list,
                 pad_left: bool=True, pad_right: bool =False, left_pad_symbol: str ='', right_pad_symbol: str ='',
                 recursive: bool =True):
        """

        :param order: 模型阶数
        :param train: 训练样本
        :param pad_left: 是否进行左填充
        :param pad_right: 是否进行右填充
        :param left_pad_symbol: 左填充符号
        :param right_pad_symbol: 右填充符号
        :param recursive: 是否生成低阶模型
        """
        self._ngrams = dict()

        # 模型阶数必须大于0
        assert (order > 0), order
        # 保存模型阶数
        self._order = order
        # 为方便检查, 为n=1的1阶模型保存一个快捷变量

        # padding的设置
        assert (isinstance(pad_left, bool))
        assert (isinstance(pad_right, bool))
        self._pad_left = pad_left
        self._pad_right = pad_right
        self._left_pad_symbol = left_pad_symbol
        self._right_pad_symbol = right_pad_symbol
    
        cfd = ConditionalFreqDist()
        self._vocabulary = set()

        # 输入适配. 如果输入的训练数据不是list>, 用一个列表包裹它
        if (train is not None) and isinstance(train[0], compat.string_types):
            train = [train]

        for sent in train:
            for ngram in self.to_ngrams(sent):
                self._vocabulary.add(ngram)
                context = tuple(ngram[:-1])
                token = ngram[-1]
                # NB, ConditionalFreqDist的接口已经改变, 已经没有方法'inc', 需要改为如下语句
                cfd[context][token] += 1

        self._ngrams[self._order] = cfd

        # NB, 关键代码: 递归生成低阶NgramCounter
        # 如果递归, 那就生成低阶概率分布, 注意还要把order-2至1阶的概率分布取回来
        if recursive and not order == 1:
            self._backoff = NgramCounter(order - 1, train,
                                         pad_left=pad_left, left_pad_symbol=left_pad_symbol,
                                         pad_right=pad_right, right_pad_symbol=right_pad_symbol)
            # 递归地把个低阶概率分布取回来
            cursor = self._backoff
            while cursor is not None:
                self._ngrams[cursor.order] = cursor.ngrams[cursor.order]
                cursor = cursor.backoff
        else:
            self._backoff = None

    @property
    def order(self) -> int:
        return self._order

    @property
    def vocabulary(self) -> set:
        return self._vocabulary

    @property
    def ngrams(self) -> dict:
        return self._ngrams

    @property
    def backoff(self) -> type('NgramCounter'):
        return self._backoff

    def check_against_vocab(self, word) -> str:
        """
        目前不对生词作任何处理
        :param word:
        """
        return word

    def to_ngrams(self, text) -> tuple:
        return ngrams(text, self._order,
                      pad_left=self._pad_left, pad_right=self._pad_right,
                      left_pad_symbol=self._left_pad_symbol, right_pad_symbol=self._right_pad_symbol)

NgramModel

有了可以递归的NgramCounter, 我们就可以继承BaseNgramModel复活NgramModel. 需要注意的两点是:

  1. 先调父类的构造函数, 因为它初始化了各种属性
  2. 注意低阶模型的递归

Talk is cheap, show me the code:


class NgramModel(BaseNgramModel):
    """
    继承模型基类'BaseNgramModel'重新实现NgramModel

    Note:
        1. 原方法'prob'和'logprob'已分别改名为'score'和'logstore'
        2. 原方法'entropy'显式对输入文本进行padding, 然而基类'BaseNgramModel'的'entorpy'没有.
            但是, 基类'BaseNgramModel'的'entorpy'的调用'NgramCounter'to_ngram, 已经进行padding.
            所以我们不需要覆盖'entropy'
    """

    def __init__(self, ngram_counter, estimator=None, *estimator_args, **estimator_kwargs):

        super(NgramModel, self).__init__(ngram_counter)

        # 设置频率平滑器, 没有就使用默认
        if estimator is None:
            estimator = _estimator

        # 使用频率平滑器, 生成ngram模型
        if not estimator_args and not estimator_kwargs:
            self._model = ConditionalProbDist(self.ngrams, estimator, len(self.ngrams))
        else:
            self._model = ConditionalProbDist(self.ngrams, estimator, *estimator_args, **estimator_kwargs)

        # 递归生成低阶模型
        if self._order > 1 and self.ngram_counter.backoff is not None:
            self._backoff = NgramModel(self.ngram_counter.backoff, estimator, *estimator_args, **estimator_kwargs)

    def score(self, word, context):
        """
        Evaluate the probability of this word in this context using Katz Backoff.
        :param word: the word to get the probability of
        :type word: str
        :param context: the context the word is in
        :type context: list(str)
        """

        context = tuple(context)
        # NB, 属性'_ngrams'已经在基类'BaseNgramModel'被赋值为'NgramCounter'的ConditionalFreqDist集合.
        # 词汇表实际上是NgramCounter的属性'vocabulary'. 具体修改如下
        # if (context + (word,) in self._ngrams) or (self._n == 1):
        if (context + (word,) in self.ngram_counter.vocabulary) or (self._order == 1):
            return self[context].prob(word)
        else:
            return self._alpha(context) * self._backoff.score(word, context[1:])

    def _alpha(self, tokens):
        return self._beta(tokens) / self._backoff._beta(tokens[1:])

    def _beta(self, tokens):
        return self[tokens].discount() if tokens in self else 1

    def choose_random_word(self, context):
        """
        Randomly select a word that is likely to appear in this context.
        :param context: the context the word is in
        :type context: list(str)
        """

        return self.generate(1, context)[-1]

    # NB, this will always start with same word if the model
    # was trained on a single text
    def generate(self, num_words, context=()):
        """
        Generate random text based on the language model.
        :param num_words: number of words to generate
        :type num_words: int
        :param context: initial words in generated string
        :type context: list(str)
        """

        text = list(context)
        for i in range(num_words):
            text.append(self._generate_one(text))
        return text

    def _generate_one(self, context):
        context = (self._lpad + tuple(context))[-self._n + 1:]
        if context in self:
            return self[context].generate()
        elif self._n > 1:
            return self._backoff._generate_one(context[1:])
        else:
            return '.'

    def __contains__(self, item):
        return tuple(item) in self._model

    def __getitem__(self, item):
        return self._model[tuple(item)]

    def __repr__(self):
        return '' % (len(self._ngrams), self._n)

结语

复活的模型和原2.x中的模型计算结果完全一致, 大家可以自行测试, 或直接运行github上的代码测试.

你可能感兴趣的:(复活NgramModel!-继承'BaseNgramModel'重新实现)