最近使用Pytorch,搭建了一个RNNLM,目的是为了利用词典中的每个词的One-Hot编码(高维的稀疏向量),来生成 Dense Vectors。这篇文章不讲解RNN原理以及为什么使用RNN语言模型,只是对pytorch中的代码使用进行讲解。
目前Pytorch的资料还比较少,我主要还是通过学习Pytorch文档+使用Pytorch官方论坛的形式来入门Pytorch
全部代码如下:
import torch
import torch.nn.functional as F
from torch import nn, optim
from torch.autograd import Variable
from numpy import *
from torch.utils.data import DataLoader
from mydataset import MyDataset
BATCH_SIZE = 5
sentence_set = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
EMBDDING_DIM = len(sentence_set)+1
HIDDEN_UNITS = 200
word_to_ix = {}
for word in sentence_set:
if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
def make_word_to_ix(word,word_to_ix):
vec = torch.zeros(EMBDDING_DIM)
#vec = torch.LongTensor(EMBDDING_DIM,1).zero_()
if word in word_to_ix:
vec[word_to_ix[word]] = 1
else:
vec[len(word_to_ix)] = 1
return vec
data_words = []
data_labels = []
for i in range(len(sentence_set) -2):
word = sentence_set[i]
label = sentence_set[i+1]
data_words.append(make_word_to_ix(word,word_to_ix))
data_labels.append(make_word_to_ix(label,word_to_ix))
dataset = MyDataset(data_words, data_labels)
train_loader = DataLoader(dataset, batch_size=BATCH_SIZE)
'''
for _,batch in enumerate(train_loader):
print("word_batch------------>\n")
print(batch[0])
print("label batch----------->\n")
print(batch[1])
'''
#'''
class RNNModel(nn.Module):
def __init__(self, embdding_size, hidden_size):
super(RNNModel, self).__init__()
self.rnn = nn.RNN(embdding_size, hidden_size,num_layers=1,nonlinearity='relu')
self.linear = nn.Linear(hidden_size, embdding_size)
def forward(self, x, hidden):
#input = x.view(BATCH_SIZE, -1)
output1, h_n = self.rnn(x, hidden)
output2 = self.linear(output1)
log_prob = F.log_softmax(output2)
return log_prob, h_n
rnnmodel = RNNModel(EMBDDING_DIM, HIDDEN_UNITS)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(rnnmodel.parameters(), lr=1e-3)
#'''
#testing
#input_hidden = torch.autograd.Variable(torch.randn(BATCH_SIZE, HIDDEN_UNITS))
#x = torch.autograd.Variable(torch.rand(BATCH_SIZE,EMBDDING_DIM))
#y,_ = rnnmodel(x,input_hidden)
#print(y)
#''''
for epoch in range(3):
print('epoch: {}'.format(epoch + 1))
print('*' * 10)
running_loss = 0
input_hidden = torch.autograd.Variable(torch.randn(BATCH_SIZE, HIDDEN_UNITS))
for _,batch in enumerate(train_loader):
x = torch.autograd.Variable(batch[0])
y = torch.autograd.Variable(batch[1])
# forward
out, input_hidden = rnnmodel(x, input_hidden)
trgt = torch.max(y, 1)[1]
loss = criterion(out, trgt)
running_loss += loss.data[0]
# backward
optimizer.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()
print('Loss: {:.6f}'.format(running_loss / len(word_to_ix)))
#'''
#print(rnnmodel.state_dict().keys())
f = open("res-0104-rnn.txt","w+")
alpha = rnnmodel.state_dict()['rnn.weight_ih_l0']
for word in sentence_set:
#print(word,torch.unsqueeze(alpha[word_to_ix[word]],0).numpy())
line = word + " " +str(torch.unsqueeze(alpha[word_to_ix[word]],0).numpy().tolist()[0])+"\n"
#print(line)
f.write(line)
f.close()
这里的预处理主要指的是将字符串分割成以单词为单位的列表,同时统计词典,为词典中的单词生成One-Hot编码。对于其他类型或者用途的语料需要使用分词工具,这里只是分割成列表,比较简单,不再赘述。
####统计词典
word_to_ix = {}
for word in sentence_set:
if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
首先声明一个dict,然后对于列表中的每个单词,如果不存在dict中,添加进dict
def make_word_to_ix(word,word_to_ix):
vec = torch.zeros(EMBDDING_DIM)
#vec = torch.LongTensor(EMBDDING_DIM,1).zero_()
if word in word_to_ix:
vec[word_to_ix[word]] = 1
else:
vec[len(word_to_ix)] = 1
return vec
EMBDDING_DIM 表示One-Hot编码的维度,这里设置为字典长度加1,为不存在于字典中的词设置一位。
例如词典为{ Apple:0 ,Banana:1,Orange:2}
Apple的One-Hot向量为 [1,0,0,0]
Banana为[0,1,0,0]
Orange为[0,0,1,0]
Lemon位[0,0,0,1]
返回结果的类型为torch.FloatTensor
,在Pytorch中,我们使用张量Tensor
来表示向量,矩阵
torch提供一个Dataset的抽象类,继承torch.utils.data.Dataset
来实现自己的dataset,然后使用Dataloader来加载数据集
关于Dataset和Dataloader的详细介绍参照Pytoch英文文档和Pytorch中文文档。
简单来说,我们将样本(包含数据和标签)封装到Dataset中,使用Dataloader,来读取数据集,进行训练
data_words = []
data_labels = []
for i in range(len(sentence_set) -2):
word = sentence_set[i]
label = sentence_set[i+1]
data_words.append(make_word_to_ix(word,word_to_ix))
data_labels.append(make_word_to_ix(label,word_to_ix))
dataset = MyDataset(data_words, data_labels)
train_loader = DataLoader(dataset, batch_size=BATCH_SIZE)
依据之前的One-Hot编码,生成样本。这里使用列表中的某个单词作为样本数据,其下一个单词作为样本标签。
###神经网络模型
torch提供了torch.nn.Module
作为所有神经网络模型的基类,自定义的神经网络应该继承nn.Module
同时实现init()
方法和forward()
方法。init()
方法定义了神经网络的结构,forward()
方法定义了神经网络模型是如何计算前馈的。
class RNNModel(nn.Module):
def __init__(self, embdding_size, hidden_size):
super(RNNModel, self).__init__()
self.rnn = nn.RNN(embdding_size, hidden_size,num_layers=1,nonlinearity='relu')
self.linear = nn.Linear(hidden_size, embdding_size)
def forward(self, x, hidden):
#input = x.view(BATCH_SIZE, -1)
output1, h_n = self.rnn(x, hidden)
output2 = self.linear(output1)
log_prob = F.log_softmax(output2)
return log_prob, h_n
rnnmodel = RNNModel(EMBDDING_DIM, HIDDEN_UNITS)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(rnnmodel.parameters(), lr=1e-3)
定义神经网络损失函数,以及optimizer
。optimizer
使用了torch.optim
,这个对象主要是对计算得到的梯度,进行参数更新。同时,在创建optimizer
的时候,我们需要设置其参数和学习率(learning rate)lr
. SGD
表示使用的是随机梯度下降算法
for epoch in range(3):
print('epoch: {}'.format(epoch + 1))
print('*' * 10)
running_loss = 0
input_hidden = torch.autograd.Variable(torch.randn(BATCH_SIZE, HIDDEN_UNITS))
for _,batch in enumerate(train_loader):
x = torch.autograd.Variable(batch[0])
y = torch.autograd.Variable(batch[1])
# forward
out, input_hidden = rnnmodel(x, input_hidden)
trgt = torch.max(y, 1)[1]
loss = criterion(out, trgt)
running_loss += loss.data[0]
# backward
optimizer.zero_grad()
loss.backward(retain_graph=True)
optimizer.step()
print('Loss: {:.6f}'.format(running_loss / len(word_to_ix)))
epoch表示我们将整个数据集训练的次数,代码中是3次。
torch.autograd.Variable
是pytorch图计算的基本单位,所有用于计算的张量,都需要放到Variable
中。在这里主要解释一下损失函数部分
criterion = nn.CrossEntropyLoss()
…
trgt = torch.max(y, 1)[1]
loss = criterion(out, trgt)
在我们创建Dataloader的时候,我们声明了batch_size–批大小。就表示,我们在训练过程中使用了minibatch,即我们不是将单个的样本放到神经网络中,然后立刻计算损失值,而是将一批数据,放入神经网络,然后计算损失值。
例如: 如果我们的数据(One-Hot向量为例)是100维的,batch_size为3,那么实际输入到rnn中的x其实是 3 x 100 维
交叉熵损失函数 CrossEntropyLoss,常用于将数据分为C类问题。其两个参数形式分别为:
input(N,C),N为batch_size,C为类别个数
target(N) 0 <= targets[i] <= C-1 ,即target为一个N x 1的向量,代表每个每个样本分类的结果为第几类
所以在这里,需要使用torch.max(y,1)
函数,其挑选出y中每一行的最大值,同时返回其最大值所在列的索引
例如:
>> a = torch.randn(4, 4)
>> a
0.0692 0.3142 1.2513 -0.5428
0.9288 0.8552 -0.2073 0.6409
1.0695 -0.0101 -2.4507 -1.2230
0.7426 -0.7666 0.4862 -0.6628
torch.FloatTensor of size 4x4]
>>> torch.max(a, 1)
(
1.2513
0.9288
1.0695
0.7426
[torch.FloatTensor of size 4]
,
2
0
0
0
[torch.LongTensor of size 4]
)
torch.max(input, dim)
中第二个参数dim
与input
的维度有关。如果input
是一个二维的,那么dim
=1,如果input
是三维,那么dim
= 2
In[2]: import torch
In[3]: a = torch.randn(2,2,2)
In[4]: a
Out[4]:
(0 ,.,.) =
0.4905 -0.2557
-0.4251 0.1878
(1 ,.,.) =
-0.4327 0.0734
-1.2723 -0.1210
[torch.FloatTensor of size 2x2x2]
In[5]: torch.max(a,2)
Out[5]:
(
0.4905 0.1878
0.0734 -0.1210
[torch.FloatTensor of size 2x2],
0 1
1 1
[torch.LongTensor of size 2x2])
接下来就是backward
的过程,注意每次要将梯度清零,避免累加。使用optimizer.step()
更新参数
补充:后来在我的github中找到了这部分代码,mydataset
部分的代码如下所示:
from __future__ import print_function
import torch.utils.data as data
import torch
from torch.utils.data import DataLoader
class MyDataset(data.Dataset):
def __init__(self, words, labels):
self.words = words
self.labels = labels
def __getitem__(self, index): # 返回的是tensor
input, target = self.words[index], self.labels[index]
return input, target
def __len__(self):
return len(self.words)