基于深度学习的文本分类 3

基于深度学习的文本分类


Transformer

Transformer是一种完全基于Attention机制来加速深度学习训练过程的算法模型,其最大的优势在于其在并行化处理上做出的贡献。换句话说,Transformer就是一个带有self-attention机制的seq2seq 模型,即输入是一个sequence,输出也是一个sequence的模型。如下图所示:

self-attention的架构

假设有 x 1 、 x 2 、 x 3 、 x 4 x1、x2、x3、x4 x1x2x3x4四个序列,首先进行带权乘法 a i = W   x i ai = W \, xi ai=Wxi,得到新的序列 a 1 、 a 2 、 a 3 、 a 4 a1、a2、a3、a4 a1a2a3a4

然后将 a i ai ai分别乘以三个不同的权重矩阵得到 q i 、 k i 、 v i qi、ki、vi qikivi三个向量, q i = W q   a i qi=Wq \, ai qi=Wqai k i = W q   a i ki=Wq \, ai ki=Wqai v i = W v   a i vi=Wv \, ai vi=Wvai q q q表示的是query,需要match其他的向量, k k k表示的是key,是需要被 q q q来match的, v v v表示value,表示需要被抽取出来的信息。

接下来让每一个 q q q对每一个 k k k做attention操作。attention操作的目的是输入两个向量,输出一个数,这里使用scaled点积来实现。 q 1 q1 q1 k i ki ki做attention得到 α 1 , i {\alpha}_{1,i} α1,i

α 1 , i = q 1 ⋅ k i / d {\alpha}_{1,i} = q1 \cdot ki / \sqrt{d} α1,i=q1ki/d

其中 d d d表示 q q q v v v的维数。

最后将 α 1 , i {\alpha}_{1,i} α1,i带入softmax函数,写成矩阵形式即为:

A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d V ) Attention(Q,K,V) = softmax \left( \frac{QK^T}{\sqrt{d}} V \right) Attention(Q,K,V)=softmax(d QKTV)

然后求出$b1 = \sum \hat{\alpha}_{1,i} \cdot vi $。整个过程如下图所示:

由上图可知,求 b 1 b1 b1的时候需要用到序列中的所有值 a 1 , a 2 , a 3 , a 4 a1,a2,a3,a4 a1,a2,a3,a4,但是对序列的每个部分的关注程度有所不同,通过改变 α ^ 1 , i \hat{\alpha}_{1,i} α^1,i前的权重 v i vi vi可以对序列的每一部分赋予不同的关注度,对重点关注的部分赋予较大的权重,不太关注的部分赋予较小的权重。
用同样的方法可以求出 b 2 , b 3 , b 4 b2,b3,b4 b2,b3,b4,而且 b 1 , b 2 , b 3 , b 4 b1,b2,b3,b4 b1,b2,b3,b4是平行地被计算出来的,相互之间没有影响。整个过程可以看作是一个self-attention layer,输入 x 1 , x 2 , x 3 , x 4 x1,x2,x3,x4 x1,x2,x3,x4,输出 b 1 , b 2 , b 3 , b 4 b1,b2,b3,b4 b1,b2,b3,b4.

Transformer所使用的注意力机制的核心思想是去计算一句话中的每个词对于这句话中所有词的相互关系,然后认为这些词与词之间的相互关系在一定程度上反应了这句话中不同词之间的关联性以及重要程度。因此再利用这些相互关系来调整每个词的重要性(权重)就可以获得每个词新的表达。这个新的表征不但蕴含了该词本身,还蕴含了其他词与这个词的关系,因此和单纯的词向量相比是一个更加全局的表达。使用了Attention机制,将序列中的任意两个位置之间的距离缩小为一个常量。

乘积部分的表示如下图:

整个transformer的结构如下图:

基于深度学习的文本分类 3_第1张图片

左边的部分是encoder,右边的部分是decoder。

encoder部分的步骤如下所示:

  • 对input进行embedding操作,将单词表示成长度为embedding size的向量

  • 对embedding之后的词向量进行positional encoding,即在生成 q 、 k 、 v q、k、v qkv的时候,给每一个 a i ai ai加上一个相同维数的向量 e i ei ei,如下图:

    e i ei ei表示了位置的信息,每一个位置对应不同的 e i ei ei,id为p的位置对应的位置向量的公式为:

    e 2 i = s i n ( p / 1000 0 2 i / d ) e_{2i} = sin \left( p/10000^{2i/d} \right) e2i=sin(p/100002i/d)

    e 2 i + 1 = c o s ( p / 1000 0 2 i / d ) e_{2i+1} = cos \left( p/10000^{2i/d} \right) e2i+1=cos(p/100002i/d)

    对于NLP中的任务来说,顺序是很重要的信息,它代表着局部甚至是全局的结构,学习不到顺序信息,那么效果将会大打折扣。通过结合位置向量和词向量,给每个词都引入了一定的位置信息,这样Attention就可以分辨出不同位置的词

  • 进行muti-head attention操作,同时生成多个 q 、 k 、 v q、k、v qkv分别进行attention(参数不同),然后把结果拼接起来

  • add & norm操作,将muti-head attention的input和output进行相加,然后进行layer normalization操作

    layer normalization(LN) 和batch normalization(BN) 的过程相反,BN表示在一个batch里面的数据相同的维度进行normalization,而LN表示在对每一个数据所有的维度进行normalization操作。假设数据的规模是10行3列,即batchsize = 10,每一行数据有三个特征,假设这三个特征是 [身高、体重、年龄]。那么BN是针对每一列(特征)进行缩放,例如算出[身高]的均值与方差,再对身高这一列的10个数据进行缩放。体重和年龄同理。这是一种“列缩放”。而layer方向相反,它针对的是每一行进行缩放。即只看一条记录,算出这条记录所有特征的均值与方差再缩放。这是一种“行缩放

  • 然后进行一个feed forwardadd & norm操作,feed forward会对每一个输入的序列进行操作,再进行一个add & norm操作

至此encoder的操作就完成了,接下来看decoder操作:

  • 进行positional encoding
  • masked muti-head attention操作,即对之前产生的序列进行attention
  • 进行add & norm操作
  • 将encoder的输出和上一轮add & norm操作的结构进行muti-head attention和add & norm操作
  • feed forward 和 add & norm,整个的过程可以重复N次
  • 经过线性层和softmax层得到最终输出

Bert

Bert(Bidirectional Encoder Representation form Transformers),即双向Transformer的Encoder,其中“双向”表示模型在处理某一个词时,它能同时利用前面的词和后面的词两部分信息。Bert的模型架构基于多层双向转换解码,通过执行一系列预训练,进而得到深层的上下文表示。Bert的基本思想和Word2Vec、CBOW一样,都是给定context,来预测下一个词。BERT的结构和ELMo相似都是双向结构。BERT模型结构如下图所示:

基于深度学习的文本分类 3_第2张图片

Bert的实现分为两个阶段:第一个阶段叫做:Pre-training,跟WordEmbedding类似,利用现有无标记的语料训练一个语言模型。第二个阶段叫做:Fine-tuning,利用预训练好的语言模型,完成具体的NLP下游任务。

Bert的输入:

Bert的输入包含三个部分:Token Embeddings、Segment Embeddings、Position Embeddings。这三个部分在整个过程中是可以学习的。

基于深度学习的文本分类 3_第3张图片
  • CLS:Classification Token,用来区分不同的类
  • SEP:Special Token,用来分隔两个不同的句子
  • Token Embedding:对输入的单词进行Embedding
  • Segment Embedding:标记输入的单词是属于句子A还是句子B
  • Position Embedding:标记每一个Token的位置

Pre-training:

Bert的预训练有两个任务:Masked Language Model(MLM)Next Sentence Predicition(NSP)。在训练Bert的时候,这两个任务是同时训练的,Bert的损失函数是把这两个任务的损失函数加起来的,是一个多任务训练。

Masked Language Model的灵感来源于完形填空,将15%的Tokens掩盖。被掩盖的15%的Tokens又分为三种情况:80%的字符用字符“MASK”替换,10%的字符用另外的字符替换;10%的字符是保持不动。然后模型尝试基于序列中其他未被掩盖的单词的上下文来预测被掩盖的原单词。最后在计算损失时,只计算被掩盖的15%的Tokens。

Next Sentence Prediction,即给出两个句子A和B,B有一半的可能性是A的下一句话,训练模型来预测B是不是A的下一句话。通过训练,使模型具备理解长序列上下文的联系的能力。

Fine-tuning:

  • 分类任务:输入端,可以是句子A和句子B也可以是单一句子,CLS后接softmax用于分类

  • 答案抽取,比如SQuAd v1.0,训练一个start和end向量分别为S,E,bert的每个输出向量和S或E计算dot product,之后对所有输出节点的点乘结果进行softmax得到该节点对应位置的起始概率或者或终止概率, 假设Bert的输出向量为 T T T,则用 S ⋅ T i + E ⋅ T j S·Ti + E·Tj STi+ETj表示从 i i i位置起始, j j j位置终止的概率,最大的概率对应 i i i j ( i < j ) j(ij(i<j)即为预测的answer span的起点终点,训练的目标是最大化正确的起始,终点的概率

  • SQuAD v2.0和SQuAD 1.1的区别在于可以有答案不存在给定文本中,因此增加CLS的节点输出为 C C C,当最大的分数对应 i , j i,j i,j所在的CLS的时候,即 S ⋅ T i + E ⋅ T j S·Ti + E·Tj STi+ETj的最大值小于 S ⋅ C + E ⋅ C S·C + E·C SC+EC时,不存在答案

代码实现

参考代码:Datawhale零基础入门NLP赛事 - Task6 基于深度学习的文本分类3-BERT

初始化:

# 初始化
import logging
import random
import numpy as np
import torch

logging.basicConfig(level=logging.INFO, format='%(asctime)-15s %(levelname)s: %(message)s')

# 设置随机种子
seed = 666
random.seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed(seed)
torch.manual_seed(seed)

# 设cuda
gpu = 0
use_cuda = gpu >= 0 and torch.cuda.is_available() # 判断是否可以使用GPU
if use_cuda:
	torch.cuda.set_device(gpu)
	device = torch.device("cuda", gpu)
else:
	device = torch.device("cpu")
logging.info("Use cuda: %s, gpu id: %d.", use_cuda, gpu)

读取数据并划分验证集:

# 按照label分层采样,把数据分成10折
fold_num = 10
data_file = './data/train.pkl'
import pandas as pd

def all_data2fold(fold_num, num=10000): 
# 读取10000条数据,分成10折
	fold_data = []
	f = pd.read_pickle(data_file)
	texts = f['text'].tolist()[:num] # 文本列表
	labels = f['label'].tolist()[:num] # 标签列表

	total = len(labels) # 标签总数

	index = list(range(total))
	np.random.shuffle(index)

	all_texts = []
	all_labels = []
	for i in index:
    	all_texts.append(texts[i]) # 打乱顺序后的文本和标签
    	all_labels.append(labels[i])

	label2id = {} 
	# 字典 key:label  value:id
	# {'0': [3, 12, 13, 15, 18, 24, 25],'1':[0, 8, 14, 20, 29,]...}
	for i in range(total):
    	label = str(all_labels[i])
    	if label not in label2id:
        	label2id[label] = [i]
    	else:
        	label2id[label].append(i)

	all_index = [[] for _ in range(fold_num)] # fold_num个元素为空列表的列表
	batch_size = int(total / fold_num)
	for label, data in label2id.items(): # 分层采样
    	batch_size = int(len(data) / fold_num) # 把每个label分成fold_num个batch
    	other = len(data) - batch_size * fold_num # 均分成fold_num个batch后剩下的,0 <= other < fold_num
    	for i in range(fold_num):
        	cur_batch_size = batch_size + 1 if i < other else batch_size # 保证cur_batch_size加起来等于len(data)
        	batch_data = [data[i * batch_size + b] for b in range(cur_batch_size)]
        	all_index[i].extend(batch_data) # [ [长度为cur_batch_size,元素为标签的id],[],...,[] ]

	batch_size = int(total / fold_num)
	other_texts = []
	other_labels = []
	other_num = 0
	start = 0
	for fold in range(fold_num): # 把分层采样后的数据拼起来
    	num = len(all_index[fold])
    	texts = [all_texts[i] for i in all_index[fold]]
    	labels = [all_labels[i] for i in all_index[fold]]

    	if num > batch_size:
    	    fold_texts = texts[:batch_size]
    	    other_texts.extend(texts[batch_size:])
    	    fold_labels = labels[:batch_size]
    	    other_labels.extend(labels[batch_size:])
    	    other_num += num - batch_size
    	elif num < batch_size:
    	    end = start + batch_size - num
    	    fold_texts = texts + other_texts[start: end]
    	    fold_labels = labels + other_labels[start: end]
    	    start = end
    	else:
    	    fold_texts = texts
    	    fold_labels = labels

    	assert batch_size == len(fold_labels)

    	# 在每个batch内再次打乱顺序
    	index = list(range(batch_size))
    	np.random.shuffle(index)

    	shuffle_fold_texts = []
    	shuffle_fold_labels = []
    	for i in index:
        	shuffle_fold_texts.append(fold_texts[i])
        	shuffle_fold_labels.append(fold_labels[i])

    	data = {'label': shuffle_fold_labels, 'text': shuffle_fold_texts}
    	fold_data.append(data)

	logging.info("Fold lens %s", str([len(data['label']) for data in fold_data]))

	return fold_data

fold_data = all_data2fold(10)

建立训练集、验证集以及测试集:

# 建立训练集、验证集以及测试集

fold_id = 9 # 验证集的id

# 验证集
dev_data = fold_data[fold_id]

# 训练集
train_texts = []
train_labels = []
for i in range(0, fold_id):
	data = fold_data[i]
	train_texts.extend(data['text'])
	train_labels.extend(data['label'])

train_data = {'label': train_labels, 'text': train_texts}

# 测试集
num=10000
data_file = './data/train.pkl'
f = pd.read_pickle(data_file)
texts = f['text'].tolist()[num:12000]
true_label = f['label'].tolist()[num:12000]
test_data = {'label': true_label, 'text': texts}

建立词典:

# 建立词典

from collections import Counter
from transformers import BasicTokenizer

# BasicTokenizer的主要是进行unicode转换、标点符号分割、小写转换、中文字符分割、去除重音符号等操作,最后返回的是关于词的数组
basic_tokenizer = BasicTokenizer() 

class Vocab():
	def __init__(self, train_data):
    	self.min_count = 5 # 最小词频
    	self.pad = 0
    	self.unk = 1
    	self._id2word = ['[PAD]', '[UNK]']
    	self._id2extword = ['[PAD]', '[UNK]']

    	self._id2label = []
    	self.target_names = []

    	self.build_vocab(train_data)

    	reverse = lambda x: dict(zip(x, range(len(x))))
    	self._word2id = reverse(self._id2word)
    	self._label2id = reverse(self._id2label)

    	logging.info("Build vocab: words %d, labels %d." % (self.word_size, self.label_size))

	def build_vocab(self, data):
    	self.word_counter = Counter()

    	for text in data['text']:
        	words = text.split() # 以空格为分隔符读取文本
        	for word in words:
            	self.word_counter[word] += 1 # 统计每个单词出现的次数

    	for word, count in self.word_counter.most_common():
        	if count >= self.min_count:
            	self._id2word.append(word)

    	label2name = {0: '科技', 1: '股票', 2: '体育', 3: '娱乐', 4: '时政', 5: '社会', 6: '教育', 7: '财经',
                  	8: '家居', 9: '游戏', 10: '房产', 11: '时尚', 12: '彩票', 13: '星座'}

    	self.label_counter = Counter(data['label']) # 统计每个标签出现的次数

    	for label in range(len(self.label_counter)):
        	count = self.label_counter[label]
        	self._id2label.append(label)
        	self.target_names.append(label2name[label]) # 把数字标签换成文字标签

	def load_pretrained_embs(self, embfile): # 加载预训练模型
    	with open(embfile, encoding='utf-8') as f:
    	    lines = f.readlines()
    	    items = lines[0].split()
    	    word_count, embedding_dim = int(items[0]), int(items[1])

    	index = len(self._id2extword)
    	embeddings = np.zeros((word_count + index, embedding_dim))
    	for line in lines[1:]:
    	    values = line.split()
    	    self._id2extword.append(values[0])
    	    vector = np.array(values[1:], dtype='float64')
    	    embeddings[self.unk] += vector
    	    embeddings[index] = vector
    	    index += 1

    	embeddings[self.unk] = embeddings[self.unk] / word_count
    	embeddings = embeddings / np.std(embeddings)

    	reverse = lambda x: dict(zip(x, range(len(x))))
    	self._extword2id = reverse(self._id2extword)

    	assert len(set(self._id2extword)) == len(self._id2extword)

    	return embeddings

	def word2id(self, xs): # 返回单词在词典中的索引 eg:vocab.word2id('3750')
    	if isinstance(xs, list):
    	    return [self._word2id.get(x, self.unk) for x in xs]
    	return self._word2id.get(xs, self.unk)

	def extword2id(self, xs):
    	if isinstance(xs, list):
    	    return [self._extword2id.get(x, self.unk) for x in xs]
    	return self._extword2id.get(xs, self.unk)

	def label2id(self, xs):
    	if isinstance(xs, list):
    	    return [self._label2id.get(x, self.unk) for x in xs]
    	return self._label2id.get(xs, self.unk)

	@property
	def word_size(self):
    	return len(self._id2word)

	@property
	def extword_size(self):
    	return len(self._id2extword)

	@property
	def label_size(self):
    	return len(self._id2label)


vocab = Vocab(train_data)

建立各个模块:

# 建立各个模块

import torch.nn as nn
import torch.nn.functional as F

# Attention部分的实现
class Attention(nn.Module):
	def __init__(self, hidden_size):
    	super(Attention, self).__init__()
    	self.weight = nn.Parameter(torch.Tensor(hidden_size, hidden_size))
    	self.weight.data.normal_(mean=0.0, std=0.05)

    	self.bias = nn.Parameter(torch.Tensor(hidden_size))
    	b = np.zeros(hidden_size, dtype=np.float32)
    	self.bias.data.copy_(torch.from_numpy(b))

    	self.query = nn.Parameter(torch.Tensor(hidden_size))
    	self.query.data.normal_(mean=0.0, std=0.05)

	def forward(self, batch_hidden, batch_masks):
    	# batch_hidden: b x len x hidden_size (2 * hidden_size of lstm)
    	# batch_masks:  b x len

    	# linear
    	key = torch.matmul(batch_hidden, self.weight) + self.bias  # b x len x hidden

    	# compute attention
    	outputs = torch.matmul(key, self.query)  # b x len

    	masked_outputs = outputs.masked_fill((1 - batch_masks).bool(), float(-1e32))

    	attn_scores = F.softmax(masked_outputs, dim=1)  # b x len

    	# 对于全零向量,-1e32的结果为 1/len, -inf为nan, 额外补0
    	masked_attn_scores = attn_scores.masked_fill((1 - batch_masks).bool(), 0.0)

    	# sum weighted sources
    	batch_outputs = torch.bmm(masked_attn_scores.unsqueeze(1), key).squeeze(1)  # b x hidden

    	return batch_outputs, attn_scores


# word encoder的实现
bert_path = './data/'
dropout = 0.15

from transformers import BertModel


class WordBertEncoder(nn.Module):
	def __init__(self):
    	super(WordBertEncoder, self).__init__()
    	self.dropout = nn.Dropout(dropout)

    	self.tokenizer = WhitespaceTokenizer()
    	self.bert = BertModel.from_pretrained(bert_path)

    	self.pooled = False
    	logging.info('Build Bert encoder with pooled {}.'.format(self.pooled))

	def encode(self, tokens):
    	tokens = self.tokenizer.tokenize(tokens)
    	return tokens

	def get_bert_parameters(self):
    	no_decay = ['bias', 'LayerNorm.weight']
    	optimizer_parameters = [
    	    {'params': [p for n, p in self.bert.named_parameters() if not any(nd in n for nd in no_decay)],
         	'weight_decay': 0.01},
        	{'params': [p for n, p in self.bert.named_parameters() if any(nd in n for nd in no_decay)],
         	'weight_decay': 0.0}
    	]
    	return optimizer_parameters

	def forward(self, input_ids, token_type_ids):
    	# input_ids: sen_num x bert_len
    	# token_type_ids: sen_num  x bert_len

    	# sen_num x bert_len x 256, sen_num x 256
    	sequence_output, pooled_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids)

    	if self.pooled:
    	    reps = pooled_output
    	else:
    	    reps = sequence_output[:, 0, :]  # sen_num x 256

    	if self.training:
    	    reps = self.dropout(reps)

    	return reps


class WhitespaceTokenizer():
	"""WhitespaceTokenizer with vocab."""

	def __init__(self):
	    vocab_file = bert_path + 'vocab.txt'
	    self._token2id = self.load_vocab(vocab_file)
	    self._id2token = {v: k for k, v in self._token2id.items()}
	    self.max_len = 256
	    self.unk = 1

	    logging.info("Build Bert vocab with size %d." % (self.vocab_size))

	def load_vocab(self, vocab_file):
	    f = open(vocab_file, 'r')
	    lines = f.readlines()
	    lines = list(map(lambda x: x.strip(), lines))
	    vocab = dict(zip(lines, range(len(lines))))
	    return vocab

	def tokenize(self, tokens):
	    assert len(tokens) <= self.max_len - 2
	    tokens = ["[CLS]"] + tokens + ["[SEP]"]
	    output_tokens = self.token2id(tokens)
	    return output_tokens

	def token2id(self, xs):
	    if isinstance(xs, list):
	        return [self._token2id.get(x, self.unk) for x in xs]
	    return self._token2id.get(xs, self.unk)

	@property
	def vocab_size(self):
	    return len(self._id2token)


# build sent encoder
sent_hidden_size = 256
sent_num_layers = 2


class SentEncoder(nn.Module):
	def __init__(self, sent_rep_size):
	    super(SentEncoder, self).__init__()
	    self.dropout = nn.Dropout(dropout)

	    self.sent_lstm = nn.LSTM(
	        input_size=sent_rep_size,
	        hidden_size=sent_hidden_size,
	        num_layers=sent_num_layers,
	        batch_first=True,
	        bidirectional=True
	    )

	def forward(self, sent_reps, sent_masks):
	    # sent_reps:  b x doc_len x sent_rep_size
	    # sent_masks: b x doc_len

	    sent_hiddens, _ = self.sent_lstm(sent_reps)  # b x doc_len x hidden*2
	    sent_hiddens = sent_hiddens * sent_masks.unsqueeze(2)

	    if self.training:
	        sent_hiddens = self.dropout(sent_hiddens)

	    return sent_hiddens

建立优化器:

# 建立优化器
learning_rate = 2e-4
bert_lr = 5e-5
decay = .75
decay_step = 1000
from transformers import AdamW, get_linear_schedule_with_warmup


class Optimizer:
	def __init__(self, model_parameters, steps):
    	self.all_params = []
    	self.optims = []
    	self.schedulers = []

    	for name, parameters in model_parameters.items():
        	if name.startswith("basic"):
        	    optim = torch.optim.Adam(parameters, lr=learning_rate)
        	    self.optims.append(optim)

        	    l = lambda step: decay ** (step // decay_step)
        	    scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=l)
        	    self.schedulers.append(scheduler)
        	    self.all_params.extend(parameters)
        	elif name.startswith("bert"):
        	    optim_bert = AdamW(parameters, bert_lr, eps=1e-8)
        	    self.optims.append(optim_bert)

        	    scheduler_bert = get_linear_schedule_with_warmup(optim_bert, 0, steps)
        	    self.schedulers.append(scheduler_bert)

        	    for group in parameters:
        	        for p in group['params']:
        	            self.all_params.append(p)
        	else:
        	    Exception("no nameed parameters.")

    	self.num = len(self.optims)

	def step(self):
    	for optim, scheduler in zip(self.optims, self.schedulers):
    	    optim.step()
    	    scheduler.step()
    	    optim.zero_grad()

	def zero_grad(self):
    	for optim in self.optims:
    	    optim.zero_grad()

	def get_lr(self):
    	lrs = tuple(map(lambda x: x.get_lr()[-1], self.schedulers))
    	lr = ' %.5f' * self.num
    	res = lr % lrs
    	return res

建立数据库:

# 建立数据库
def sentence_split(text, vocab, max_sent_len=256, max_segment=16):
	words = text.strip().split()
	document_len = len(words)

	index = list(range(0, document_len, max_sent_len))
	index.append(document_len)

	segments = []
	for i in range(len(index) - 1):
	    segment = words[index[i]: index[i + 1]]
	    assert len(segment) > 0
	    segment = [word if word in vocab._id2word else '' for word in segment]
	    segments.append([len(segment), segment])

	assert len(segments) > 0
	if len(segments) > max_segment:
	    segment_ = int(max_segment / 2)
	    return segments[:segment_] + segments[-segment_:]
	else:
	    return segments


def get_examples(data, word_encoder, vocab, max_sent_len=256, max_segment=8):
	label2id = vocab.label2id
	examples = []

	for text, label in zip(data['text'], data['label']):
	    # label
	    id = label2id(label)

	    # words
	    sents_words = sentence_split(text, vocab, max_sent_len-2, max_segment)
	    doc = []
	    for sent_len, sent_words in sents_words:
	        token_ids = word_encoder.encode(sent_words)
	        sent_len = len(token_ids)
	        token_type_ids = [0] * sent_len
	        doc.append([sent_len, token_ids, token_type_ids])
	    examples.append([id, len(doc), doc])

	logging.info('Total %d docs.' % len(examples))
	return examples

划分Batch:

# Batch划分

def batch_slice(data, batch_size):
	batch_num = int(np.ceil(len(data) / float(batch_size)))
	for i in range(batch_num):
	    cur_batch_size = batch_size if i < batch_num - 1 else len(data) - batch_size * i
	    docs = [data[i * batch_size + b] for b in range(cur_batch_size)]

	    yield docs


def data_iter(data, batch_size, shuffle=True, noise=1.0):
	"""
	randomly permute data, then sort by source length, and partition into batches
	ensure that the length of  sentences in each batch
	"""

	batched_data = []
	if shuffle:
	    np.random.shuffle(data)

	    lengths = [example[1] for example in data]
	    noisy_lengths = [- (l + np.random.uniform(- noise, noise)) for l in lengths]
	    sorted_indices = np.argsort(noisy_lengths).tolist()
	    sorted_data = [data[i] for i in sorted_indices]
	else:
	    sorted_data =data
    
	batched_data.extend(list(batch_slice(sorted_data, batch_size)))

	if shuffle:
	    np.random.shuffle(batched_data)

	for batch in batched_data:
	    yield batch

其他用到的函数:

# 其他的一些函数
from sklearn.metrics import f1_score, precision_score, recall_score

# 获得评价函数
def get_score(y_ture, y_pred):
	y_ture = np.array(y_ture)
	y_pred = np.array(y_pred)
	f1 = f1_score(y_ture, y_pred, average='macro') * 100
	p = precision_score(y_ture, y_pred, average='macro') * 100
	r = recall_score(y_ture, y_pred, average='macro') * 100

	return str((reformat(p, 2), reformat(r, 2), reformat(f1, 2))), reformat(f1, 2)

def reformat(num, n):
	return float(format(num, '0.' + str(n) + 'f'))

训练函数:

# 建立训练器

import time
from sklearn.metrics import classification_report

clip = 5.0
epochs = 1
early_stops = 3
log_interval = 50

test_batch_size = 16
train_batch_size = 16

save_model = './bert.bin' # 保存训练模型
save_test = './bert.csv' # 保存训练结果(预测的label)

class Trainer():
	def __init__(self, model, vocab):
	    self.model = model
	    self.report = True
    
	    self.train_data = get_examples(train_data, model.word_encoder, vocab)
	    self.batch_num = int(np.ceil(len(self.train_data) / float(train_batch_size)))
	    self.dev_data = get_examples(dev_data, model.word_encoder, vocab)
	    self.test_data = get_examples(test_data, model.word_encoder, vocab)

	    # criterion
	    self.criterion = nn.CrossEntropyLoss()

	    # label name
	    self.target_names = vocab.target_names

	    # optimizer
	    self.optimizer = Optimizer(model.all_parameters, steps=self.batch_num * epochs)

	    # count
	    self.step = 0
	    self.early_stop = -1
	    self.best_train_f1, self.best_dev_f1 = 0, 0
	    self.last_epoch = epochs

	def train(self):
	    logging.info('Start training...')
	    for epoch in range(1, epochs + 1):
	        train_f1 = self._train(epoch)

	        dev_f1 = self._eval(epoch)

	        if self.best_dev_f1 <= dev_f1:
	            logging.info(
	                "Exceed history dev = %.2f, current dev = %.2f" % (self.best_dev_f1, dev_f1))
	            torch.save(self.model.state_dict(), save_model)

	            self.best_train_f1 = train_f1
	            self.best_dev_f1 = dev_f1
	            self.early_stop = 0
	        else:
	            self.early_stop += 1
	            if self.early_stop == early_stops:
	                logging.info(
	                    "Eearly stop in epoch %d, best train: %.2f, dev: %.2f" % (
	                        epoch - early_stops, self.best_train_f1, self.best_dev_f1))
	                self.last_epoch = epoch
	                break
	def test(self):
	    self.model.load_state_dict(torch.load(save_model))
	    self._eval(self.last_epoch + 1, test=True)

	def _train(self, epoch):
	    self.optimizer.zero_grad()
	    self.model.train()

	    start_time = time.time()
	    epoch_start_time = time.time()
	    overall_losses = 0
	    losses = 0
	    batch_idx = 1
	    y_pred = []
	    y_true = []
	    for batch_data in data_iter(self.train_data, train_batch_size, shuffle=True):
	        torch.cuda.empty_cache()
	        batch_inputs, batch_labels = self.batch2tensor(batch_data)
	        batch_outputs = self.model(batch_inputs)
	        loss = self.criterion(batch_outputs, batch_labels)
	        loss.backward()

	        loss_value = loss.detach().cpu().item()
	        losses += loss_value
	        overall_losses += loss_value

	        y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())
	        y_true.extend(batch_labels.cpu().numpy().tolist())

	        nn.utils.clip_grad_norm_(self.optimizer.all_params, max_norm=clip)
	        for optimizer, scheduler in zip(self.optimizer.optims, self.optimizer.schedulers):
	            optimizer.step()
	            scheduler.step()
	        self.optimizer.zero_grad()

	        self.step += 1

	        if batch_idx % log_interval == 0:
	            elapsed = time.time() - start_time

	            lrs = self.optimizer.get_lr()
	            logging.info(
	                '| epoch {:3d} | step {:3d} | batch {:3d}/{:3d} | lr{} | loss {:.4f} | s/batch {:.2f}'.format(
	                    epoch, self.step, batch_idx, self.batch_num, lrs,
	                    losses / log_interval,
	                    elapsed / log_interval))

	            losses = 0
	            start_time = time.time()

	        batch_idx += 1

	    overall_losses /= self.batch_num
	    during_time = time.time() - epoch_start_time

	    # reformat
	    overall_losses = reformat(overall_losses, 4)
	    score, f1 = get_score(y_true, y_pred)

	    logging.info(
	        '| epoch {:3d} | score {} | f1 {} | loss {:.4f} | time {:.2f}'.format(epoch, score, f1,
                                                                              	overall_losses,
                                                                              	during_time))
	    if set(y_true) == set(y_pred) and self.report:
	        report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)
	        logging.info('\n' + report)

	    return f1

	def _eval(self, epoch, test=False):
	    self.model.eval()
	    start_time = time.time()
	    data = self.test_data if test else self.dev_data
	    y_pred = []
	    y_true = []
	    with torch.no_grad():
	        for batch_data in data_iter(data, test_batch_size, shuffle=False):
	            torch.cuda.empty_cache()
	            batch_inputs, batch_labels = self.batch2tensor(batch_data)
	            batch_outputs = self.model(batch_inputs)
	            y_pred.extend(torch.max(batch_outputs, dim=1)[1].cpu().numpy().tolist())
	            y_true.extend(batch_labels.cpu().numpy().tolist())

	        score, f1 = get_score(y_true, y_pred)

	        during_time = time.time() - start_time
	        
	        if test:
	            df = pd.DataFrame({'label': y_pred})
	            df.to_csv(save_test, index=False, sep=',')
	        else:
	            logging.info(
	                '| epoch {:3d} | dev | score {} | f1 {} | time {:.2f}'.format(epoch, score, f1,
                                                                          	during_time))
            	if set(y_true) == set(y_pred) and self.report:
                	report = classification_report(y_true, y_pred, digits=4, target_names=self.target_names)
                	logging.info('\n' + report)

    	return f1

	def batch2tensor(self, batch_data):
    	'''
    	    [[label, doc_len, [[sent_len, [sent_id0, ...], [sent_id1, ...]], ...]]
    	'''
    	batch_size = len(batch_data)
    	doc_labels = []
    	doc_lens = []
    	doc_max_sent_len = []
    	for doc_data in batch_data:
    	    doc_labels.append(doc_data[0])
    	    doc_lens.append(doc_data[1])
    	    sent_lens = [sent_data[0] for sent_data in doc_data[2]]
    	    max_sent_len = max(sent_lens)
    	    doc_max_sent_len.append(max_sent_len)

    	max_doc_len = max(doc_lens)
    	max_sent_len = max(doc_max_sent_len)

    	batch_inputs1 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)
    	batch_inputs2 = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.int64)
    	batch_masks = torch.zeros((batch_size, max_doc_len, max_sent_len), dtype=torch.float32)
    	batch_labels = torch.LongTensor(doc_labels)

    	for b in range(batch_size):
    	    for sent_idx in range(doc_lens[b]):
    	        sent_data = batch_data[b][2][sent_idx]
    	        for word_idx in range(sent_data[0]):
    	            batch_inputs1[b, sent_idx, word_idx] = sent_data[1][word_idx]
    	            batch_inputs2[b, sent_idx, word_idx] = sent_data[2][word_idx]
    	            batch_masks[b, sent_idx, word_idx] = 1

    	if use_cuda:
    	    batch_inputs1 = batch_inputs1.to(device)
    	    batch_inputs2 = batch_inputs2.to(device)
    	    batch_masks = batch_masks.to(device)
    	    batch_labels = batch_labels.to(device)

    	return (batch_inputs1, batch_inputs2, batch_masks), batch_labels

进行训练:

# 进行训练
trainer = Trainer(model, vocab)
trainer.train()

运行结果如下:

基于深度学习的文本分类 3_第4张图片

参考文章:

Transformer详解
一文读懂「Attention is All You Need」| 附代码实现

你可能感兴趣的:(NLP学习,自然语言处理,pytorch,tensorflow,深度学习,机器学习)