文章开头便引用了一句话:We tend to look through language and not realize how much power language has. 我们往往低估了语言的力量。文本摘要抽取,文本生成,文本自动填充这些任务都依赖于Language Model (LM),事实上,LM是大部分NLP任务的基石,本篇文章就带我们由浅入深,亲自实践LM去了解它的广度与深度。
A language model learns to predict the probobality of a sequence of words. LM学习预测一个词序列出现的概率。如何理解一个词序列出现的概率?以一个机器翻译的例子来说明,在机器翻译任务中,通常是给你一个词序列,让你把它转换成另一个词序列,你需要估计转换后的词序列的概率分布,概率最高的那个词序列就是一个理想的翻译结果。比如以下两个词序列:the cat is small 和 small the is cat,很明显第一个词序列出现的概率更高。当模型能够学习到语言的规律(词序列的概率分布)时,就可以解决很多NLP任务。
1. Statistical Language Model 统计语言模型。 这类LM利用一些传统的统计模型如N-gram, HMM,或者一些特定的统计规则来学习词的概率分布。
2. Neural Language Model 神经语言模型。利用神经网络来建模的语言模型。
N-gram 就是一个长度为N的词序列,可以通过下面这个例子来理解N-gram:
“I love reading blogs about data science on Analytics Vidhya.”
这个句子中可以抽出1-gram有"I","love","reading"等等由一个词构成的单元,2-gram包括"I love","love reading","reading blogs"等由两个连续词构成的序列。一个N-gram Model可以预测自然语言中一个长度为N的词序列出现的概率。
为了预测长度为N的语言序列的概率,即构建该N-gram Model,需要使用链式法则来获取N个词出现的联合概率分布:
p(w1...ws) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)
可以发现当历史很长时,若考虑所有的历史词p(wn | w1...wn-1),会使得模型空间过大,模型过于复杂,同时也会有数据稀疏等问题。因此,通常使用马尔科夫假设,使得下一个要预测的词只与当前一个词有关,与其他历史词无关,从而使得条件概率简化为p(wn | wn-1)。
下面来实操构建一个N-gram Language Model。使用的Reuters数据集共包含10,788篇新闻文档,1,300,000个词。使用以下代码可以构建一个language model:
# code courtesy of https://nlpforhackers.io/language-models/
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))
# Count frequency of co-occurance
for sentence in reuters.sents():
for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
model[(w1, w2)][w3] += 1
# Let's transform the counts to probabilities
for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count
代码的逻辑非常简明清晰,首先统计出所有文章中出现的三元组(w1, w2, w3),统计出每一对(w1, w2)在给定的条件下,w3的概率。基于这个模型,就可以不断得预测下一个即将出现的词。注意这里与之前说的马尔科夫假设不同,多考虑了一个历史词p(wn | w1...wn-1)=p(wn | wn-2wn-1)。
可以看到以"today the"为起始的两个词,采用不同的阈值生成的句子具有一定的可读性。这个N-gram模型与 Google、Alexa 和 Apple 等公司用于语言建模的基本原理相同。
1. N-gram模型通常当N取值越大效果越好,但是随着N的增加计算的代价也会大幅增加,对内存资源的消耗是指数级增加的。
2. N-gram模型是离散化地建模语言模型,对于没有在语料中共同出现的词,联合概率为0。
深度学习在很多NLP任务上都取得了很好的表现,比如摘要生成,机器翻译。这些任务都是基于LM的,因而有很多研究开始致力于使用深度神经网络来建模LM。使用Neural LM可以建模字符级别(character level)或者词级别(word level)的LM,下面以字符级别的LM为例。
首先对问题进行描述,Neural LM要求通过给定的语料训练一个LM,随后在给定text的基础上生成后续的内容,使得其符合给定语料的风格同时满足语法要求。
下面尝试构建Neural LM,给定语料是独立宣言,引入需要的package,并读取独立宣言文本:
import numpy as np
import pandas as pd
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, GRU, Embedding
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
file_name = "Declaration_of_Independence.txt"
data_text = ""
for line in open(file_name):
data_text += line.strip()
随后对文本进行简单的过滤,过滤方式就是1)将大写字母都转换成小写字母。2)将's结尾的单词去掉's 3)去掉标点 4)去掉长度小于3的单词:
import re
def text_cleaner(text):
# lower case text
newString = text.lower()
newString = re.sub(r"'s\b","",newString)
# remove punctuations
newString = re.sub("[^a-zA-Z]", " ", newString)
# remove short word
for i in newString.split():
if len(i)>=3:
return (" ".join(long_words)).strip()
# preprocess the text
data_new = text_cleaner(data_text)
def create_seq(text):
length = 30
sequences = list()
for i in range(length, len(text)):
# select sequence of tokens
seq = text[i-length:i+1]
# store
print('Total Sequences: %d' % len(sequences))
return sequences
# create sequences
sequences = create_seq(data_new)
# create a character mapping index
chars = sorted(list(set(data_new)))
mapping = dict((c, i) for i, c in enumerate(chars))
def encode_seq(seq):
sequences = list()
for line in seq:
# integer encode line
encoded_seq = [mapping[char] for char in line]
# store
return sequences
# encode the sequences
sequences = encode_seq(sequences)
此时的输入序列变成了若干长度为31的id list,id范围为0~25。下面划分训练集和验证集,验证集占比为10%:
from sklearn.model_selection import train_test_split
# vocabulary size
vocab = len(mapping)
sequences = np.array(sequences)
# create X and y
X, y = sequences[:,:-1], sequences[:,-1]
# one hot encode y
y = to_categorical(y, num_classes=vocab)
# create train and validation sets
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.1, random_state=42)
print('Train shape:', X_tr.shape, 'Val shape:', X_val.shape)
输出为Train shape: (6345, 30) Val shape: (706, 30)。下面来构建模型,模型由简单的三层组成。维度为50的embedding层,隐含层维度为150的GRU层以及以softmax为激活函数的全连接层:
# define model
model = Sequential()
model.add(Embedding(vocab, 50, input_length=30, trainable=True))
model.add(GRU(150, recurrent_dropout=0.1, dropout=0.1))
model.add(Dense(vocab, activation='softmax'))
# compile the model
model.compile(loss='categorical_crossentropy', metrics=['acc'], optimizer='adam')
# fit the model
model.fit(X_tr, y_tr, epochs=100, verbose=2, validation_data=(X_val, y_val))
# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
in_text = seed_text
# generate a fixed number of characters
for _ in range(n_chars):
# encode the characters as integers
encoded = [mapping[char] for char in in_text]
# truncate sequences to a fixed length
encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
# predict character
yhat = model.predict_classes(encoded, verbose=0)
# reverse map integer to character
out_char = ''
for char, index in mapping.items():
if index == yhat:
out_char = char
# append to input
in_text += char
return in_text
2019年2月,OpenAI使用大规模语料训练了基于Transformer的语言模型名叫GPT-2。GPT-2是一个基于Transformer decoder的生成式语言模型,在互联网上40GB语料上训练得到,GPT-2论文见
下面将基于PyTorch-Transformers来使用GPT-2。PyTorch-Transformers包括许多SOTA的预训练模型,文章建议使用Google Colab来运行示例代码:
# Import required libraries
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Encode a text inputs
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)
# Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])
# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Set the model in evaluation mode to deactivate the DropOut modules
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
# Predict all tokens
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
# Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
# Print the predicted word
代码中使用预训练好的gpt-2模型来预测what is the fastest car in the __ 这个词,google上预测的答案是"world",模型预测结果为:
结果和Google给的query suggestion一致,说明gpt-2的效果很强。
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
这段文字是诗歌“The Road Not Taken”的第一段,下面使用PyTorch-Transformers写好的脚本直接来生成后面的段落。直接在google colab上运行下面的命令:
!git clone https://github.com/huggingface/pytorch-transformers.git
!python pytorch-transformers/examples/pytorch/text-generation/run_generation.py \
--model_type=gpt2 \
--length=100 \
--model_name_or_path=gpt2 \
Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could To where it bent in the undergrowth; He was no man who could resist, Nor had any one in his own reach To laugh at his vanity; The only thing which could lull him, for he could not remember, The complete unspoken confession of the suffering of Mr. Ford's face, and his mournful agony. And glad I saw them yet. In a darkness beneath the fell moon I at last heard the sigh, Until the moon lifted from the earth's right bethought. Every man rushes before death, When death brings about
这篇笔记的确让人由浅入深地理解了什么是Language Model,并实打实地教会大家如何手动搭建一个statistical LM以及Neural LM,后续我会更新一些常见的LM的具体细节介绍。