使用BERT和GPT-2计算句子困惑度PPL

定义

BERT

一个使用bert计算ppl的github项目

how-do-i-use-bertformaskedlm-or-bertmodel-to-calculate-perplexity-of-a-sentence

Chinese-BERT-wwm

对于给定的sentence,按顺序依次mask掉一个token,并计算所预测单词的nll loss,将所有的token的loss求和再取平均,最后取以自然数为底的次方即为该句话的PPL。

测试写法:

import numpy as np
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForMaskedLM
# Load pre-trained model (weights)
with torch.no_grad():
    model = BertForMaskedLM.from_pretrained('hfl/chinese-bert-wwm-ext')
    model.eval()
    # Load pre-trained model tokenizer (vocabulary)
    tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm-ext')
    sentence = "我不会忘记和你一起奋斗的时光。"
    tokenize_input = tokenizer.tokenize(sentence)
    tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])
    sen_len = len(tokenize_input)
    sentence_loss = 0.

    for i, word in enumerate(tokenize_input):
        # add mask to i-th character of the sentence
        tokenize_input[i] = '[MASK]'
        mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)])

        output = model(mask_input)

        prediction_scores = output[0]
        softmax = nn.Softmax(dim=0)
        ps = softmax(prediction_scores[0, i]).log()
        word_loss = ps[tensor_input[0, i]]
        sentence_loss += word_loss.item()

        tokenize_input[i] = word
    ppl = np.exp(-sentence_loss/sen_len)
    print(ppl)

tensor思维的写法:

def score(model, tokenizer, sentence,  mask_token_id=103):
  tensor_input = tokenizer.encode(sentence, return_tensors='pt')
  repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
  mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
  masked_input = repeat_input.masked_fill(mask == 1, 103)
  labels = repeat_input.masked_fill( masked_input != 103, -100)
  loss,_ = model(masked_input, masked_lm_labels=labels)
  result = np.exp(loss.item())
  return result

s = score(model, tokenizer, '我不会忘记和你一起奋斗的时光。')
print(s)

GPT-2

GPT2-Chinese

官方的gpt-2不支持中文,且是BPE分词方式。对于中文,有NLPer训练出了中文的gpt-2模型,且分词采用的是bert tokenizer的分词方式。

对于给定的sentence,若其长度为n,首先将其向左偏移一位作为label,将其去除末位作为input,将gpt-2的输出与label求cross entroy loss,再求以自然数为底的次方即为该句话的PPL。

import torch
from transformers import BertTokenizer, GPT2LMHeadModel
from torch.nn import CrossEntropyLoss


def cal_ppl_bygpt2():
    sens = ["今天是个好日子。", "天今子日。个是好", "这个婴儿有900000克呢。", "我不会忘记和你一起奋斗的时光。",
            "我不会记忘和你一起奋斗的时光。", "会我记忘和你斗起一奋的时光。"]
    tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall")
    model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-cluecorpussmall")
    inputs = tokenizer(sens, padding='max_length', max_length=50, truncation=True, return_tensors="pt")
    bs, sl = inputs['input_ids'].size()
    outputs = model(**inputs, labels=inputs['input_ids'])
    logits = outputs[1]
    # Shift so that tokens < n predict n
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = inputs['input_ids'][:, 1:].contiguous()
    shift_attentions = inputs['attention_mask'][:, 1:].contiguous()
    # Flatten the tokens
    loss_fct = CrossEntropyLoss(ignore_index=0, reduction="none")
    loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)).detach().reshape(bs, -1)
    meanloss = loss.sum(1) / shift_attentions.sum(1)
    ppl = torch.exp(meanloss).numpy().tolist()
    return ppl


if __name__ == '__main__':
    cal_ppl_bygpt2()

你可能感兴趣的:(NLP,pytorch,bert,自然语言处理,深度学习,PPL)