

提前安装torchtext和scapy,运行下面语句(压缩包地址链接:https://pan.baidu.com/s/1_syic9B-SXKQvkvHlEf78w 提取码:ahh3):

pip install torchtext

pip install scapy

pip install 你的地址\en_core_web_md-2.2.5.tar.gz
  • 在torchtext中使用spacy时,由于field的默认属性是tokenizer_language='en'

  • 当使用 en_core_web_md 时要改 field.py文件中 创建的field属性为tokenizer_language='en_core_web_md',且data.Field()中的参数也要改为tokenizer_language='en_core_web_md'

1. 加载数据

1.1 分割训练集测试集

import numpy as np
import torch
from torch import nn, optim
from torchtext import data, datasets

# 为CPU设置随机种子

# 两个Field对象定义字段的处理方法(文本字段、标签字段)
TEXT = data.Field(tokenize='spacy', tokenizer_language='en_core_web_md')  # 分词
LABEL = data.LabelField(dtype=torch.float)

# IMDB共50000影评,包含正面和负面两个类别。数据被前面的Field处理
# 按照(TEXT, LABEL) 分割成 训练集,测试集
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

print('len of train data:', len(train_data))        # 25000
print('len of test data:', len(test_data))          # 25000

# torchtext.data.Example : 用来表示一个样本,数据+标签
print(train_data.examples[15].text)                 # 文本:句子的单词列表
print(train_data.examples[15].label)                # 标签: 积极
len of train data: 25000
len of test data: 25000
['Like', 'one', 'of', 'the', 'previous', 'commenters', 'said', ',', 'this', 'had', 'the', 'foundations', 'of', 'a', 'great', 'movie', 'but', 'something', 'happened', 'on', 'the', 'way', 'to', 'delivery', '.', 'Such', 'a', 'waste', 'because', 'Collette', "'s", 'performance', 'was', 'eerie', 'and', 'Williams', 'was', 'believable', '.', 'I', 'just', 'kept', 'waiting', 'for', 'it', 'to', 'get', 'better', '.', 'I', 'do', "n't", 'think', 'it', 'was', 'bad', 'editing', 'or', 'needed', 'another', 'director', ',', 'it', 'could', 'have', 'just', 'been', 'the', 'film', '.', 'It', 'came', 'across', 'as', 'a', 'Canadian', 'movie', ',', 'something', 'like', 'the', 'first', 'few', 'seasons', 'of', 'X', '-', 'Files', '.', 'Not', 'cheap', ',', 'just', 'hokey', '.', 'Also', ',', 'it', 'needed', 'a', 'little', 'more', 'suspense', '.', 'Something', 'that', 'makes', 'you', 'jump', 'off', 'your', 'seat', '.', 'The', 'movie', 'reached', 'that', 'moment', 'then', 'faded', 'away', ';', 'kind', 'of', 'like', 'a', 'false', 'climax', '.', 'I', 'can', 'see', 'how', 'being', 'too', 'suspenseful', 'would', 'have', 'taken', 'away', 'from', 'the', '"', 'reality', '"', 'of', 'the', 'story', 'but', 'I', 'thought', 'that', 'part', 'was', 'reached', 'when', 'Gabriel', 'was', 'in', 'the', 'hospital', 'looking', 'for', 'the', 'boy', '.', 'This', 'movie', 'needs', 'to', 'have', 'a', 'Director', "'s", 'cut', 'that', 'tries', 'to', 'fix', 'these', 'problems', '.']
  • 当我们把句子传进模型的时候,是按照一个个batch传进去的,而且每个batch中的句子必须是相同的长度。

  • 为了确保句子的长度相同,TorchText会把短的句子 pad到和最长的句子 等长。

1.2 创建vocabulary

  • vocabulary把每个单词一一映射到一个数字。使用10k个单词来构建单词表(用max_size这个参数可以设定),所有其他的单词都用来表示。

  • 词典中应当有10002个单词,且有两个label,可以通过TEXT.vocabTEXT.label查询,可以直接用stoi(stringtoint) 或者 itos(inttostring) 来查看单词表。

TEXT.build_vocab(train_data, max_size=10000, vectors='glove.6B.100d')

print(len(TEXT.vocab))             # 10002
print(TEXT.vocab.itos[:12])        # ['', '', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'I']
print(TEXT.vocab.stoi['and'])      # 5
print(LABEL.vocab.stoi)            # defaultdict(None, {'neg': 0, 'pos': 1})
['', '', 'the', ',', '.', 'and', 'a', 'of', 'to', 'is', 'in', 'I']
defaultdict(, {'neg': 0, 'pos': 1})

1.3 创建iteratiors

  • 每个iterator中各有两部分:词(.text)和标签(.label),其中 text 全部转换成数字了

  • BucketIterator会把长度差不多的句子放到同一个batch中,确保每个batch中不出现太多的padding。

  • 这里因为pad比较少,所以把 也当做了模型的输入进行训练。

  • 如果有GPU,还可以指定每个iteration返回的tensor 都在GPU上。

batchsz = 30
train_iterator, test_iterator = data.BucketIterator.splits(
                                (train_data, test_data),
                                batch_size = batchsz,


batchsz = 30
device = torch.device('cuda')
train_iterator, test_iterator = data.BucketIterator.splits(
                                (train_data, test_data),
                                batch_size = batchsz,

2. 定义模型

class RNN(nn.Module):

  def __init__(self, vocab_size, embedding_dim, hidden_dim):
    super(RNN, self).__init__()

    # [0-10001] => [100]
    # 参数1:embedding个数(单词数), 参数2:embedding的维度(词向量维度)
    self.embedding = nn.Embedding(vocab_size, embedding_dim)
    # [100] => [256]
    # 双向LSTM,所以下面FC层使用 hidden_dim*2
    self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=2,
                       bidirectional=True, dropout=0.5) 
    # [256*2] => [1]
    self.fc = nn.Linear(hidden_dim*2, 1)
    self.dropout = nn.Dropout(0.5)

  def forward(self, x):
    x: [seq_len, b] vs [b, 3, 28, 28]
    # [seq_len, b, 1] => [seq_len, b, 100]
    embedding = self.dropout(self.embedding(x))

    # output: [seq, b, hid_dim*2]
    # hidden/h: [num_layers*2, b, hid_dim]
    # cell/c: [num_layers*2, b, hid_dim]
    output, (hidden, cell) = self.rnn(embedding)
    # [num_layers*2, b, hid_dim] => 2 of [b, hid_dim] => [b, hid_dim*2]
    # 双向,所以要把最后两个输出连接
    hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)
    # [b, hid_dim*2] => [b, 1]
    hidden = self.dropout(hidden)
    out = self.fc(hidden)

    return out
  • 使用 预训练过的embedding 来替换随机初始化

  • Tip:.copy_() 这种 带着下划线的函数 均代表 替换inplace

rnn = RNN(len(TEXT.vocab), 100, 256)                          #词个数,词嵌入维度,输出维度

pretrained_embedding = TEXT.vocab.vectors
print('pretrained_embedding:', pretrained_embedding.shape)    # torch.Size([10002, 100])

# 使用预训练过的embedding来替换随机初始化
print('embedding layer inited.')
pretrained_embedding: torch.Size([10002, 100])
embedding layer inited.

3. 训练模型

  • 首先定义模型和损失函数。
optimizer = optim.Adam(rnn.parameters(), lr=1e-3)

# BCEWithLogitsLoss是针对二分类的CrossEntropy
criteon = nn.BCEWithLogitsLoss()


optimizer = optim.Adam(rnn.parameters(), lr=1e-3)
# BCEWithLogitsLoss是针对二分类的CrossEntropy
criteon = nn.BCEWithLogitsLoss().to(device)
  (embedding): Embedding(10002, 100)
  (rnn): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
  • 定义一个函数用于计算准确率
def binary_acc(preds, y):

    preds = torch.round(torch.sigmoid(preds))
    correct = torch.eq(preds, y).float()
    acc = correct.sum() / len(correct)
    return acc
  • 定义一个训练函数
def train(rnn, iterator, optimizer, criteon):
    avg_acc = []
    rnn.train()   # 表示进入训练模式

    for i, batch in enumerate(iterator):
        # [seq, b] => [b, 1] => [b]
        # batch.text 就是上面forward函数的参数text,压缩维度是为了和batch.label维度一致
        pred = rnn(batch.text).squeeze(1)

        loss = criteon(pred, batch.label)
        # 计算每个batch的准确率
        acc = binary_acc(pred, batch.label).item()

        optimizer.zero_grad()  # 清零梯度准备计算
        loss.backward()        # 反向传播
        optimizer.step()       # 更新训练参数

        if i % 10 == 0:
            print(i, acc)

    avg_acc = np.array(avg_acc).mean()
    print('avg acc:', avg_acc)

4. 评估模型

  • 定义一个评估函数,和训练函数高度重合

  • 区别是要把rnn.train()改为rnn.val(),不需要反向传播过程。

def evaluate(rnn, iterator, criteon):
    avg_acc = []
    rnn.eval()         # 表示进入测试模式

    with torch.no_grad():
        for batch in iterator:
            pred = rnn(batch.text).squeeze(1)      # [b, 1] => [b]
            loss = criteon(pred, batch.label)
            acc = binary_acc(pred, batch.label).item()

    avg_acc = np.array(avg_acc).mean()

    print('test acc:', avg_acc)

5. 运行

for epoch in range(10):
    # 训练模型
    train(rnn, train_iterator, optimizer, criteon)
    # 评估模型
    evaluate(rnn, test_iterator, criteon)
view result
test acc: 0.8775779855051201
test acc: 0.8886890964542361
test acc: 0.8872902161068768
test acc: 0.890008042184569
test acc: 0.8848521674422624
test acc: 0.8718625588668621
test acc: 0.8822142779827118
test acc: 0.8769784666222634
test acc: 0.8815348212667506
test acc: 0.8754996503714463

6. 预测

  • 输出的预测:是('pos':1, 'neg':0)字符串的编号
for batch in test_iterator:
    # batch_size个预测
    preds = rnn(batch.text).squeeze(1)
    preds = predice_test(preds)
    # print(preds)

    i = 0
    for text in batch.text:
        # 遍历一句话里的每个单词
        for word in text:
            print(TEXT.vocab.itos[word], end=' ')
        # 输出3句话
        if i == 3:
        i = i + 1

    i = 0
    for pred in preds:
        idx = int(pred.item())
        print(idx, LABEL.vocab.itos[idx])
        # 输出3个结果(标签)
        if i == 3:
        i = i + 1
Anyone  Great A If  Without The Brilliant This  This If This Ten Absolutely For A This One Add a Just This I More What Brilliant Read  
who Classic story great you hires a  . movie it is you is minutes fantastic pure touching is of this mesmerizing love is hope suspenseful a and the  
gives Waters , film 've a doubt mixed  is with the like quite of !  movie a the little film the a this , script moving book  
this ! great in ever psychopath , with along terrible all greatest  possibly people Whatever vampire . good funniest gem that interplay great group more , performances , interpretation 
1 pos
1 pos
1 pos
1 pos
