笔者在学习基于huggingface实现多分类问题时,使用了kaggle比赛中的Feedback Prize - Predicting Effective Arguments中的数据集。
Feedback Prize - Predicting Effective Arguments/Dataset
本次比赛的目标是将学生写作中的争论元素分类为“有效”、“充分”或“无效” 。
提供的数据集包含美国 6-12 年级学生撰写的议论文。这些文章由专家评分者注释,用于议论文中常见的话语元素:
Lead - 以统计数据、引文、描述或其他一些手段开始的介绍
Position - 对主要问题的意见或结论
Claim - 支持该立场的主张
Counterclaim - 反驳另一项主张或对该立场提出相反理由的主张
Rebuttal- 反驳反诉的主张
Evidence - 支持主张、反诉或反驳的想法或例子
Concluding Statement - 重申声明的结论性声明
参赛者的任务是预测每个话语元素的质量等级。人类读者将每个修辞或论证元素按质量递增的顺序评为以下之一: Ineffective Adequate Effective
我们这里主要使用这个数据集的train.csv和test.csv,它们的内容如下:
train.csv ...
discourse_id essay_id discourse_text discourse_type discourse_effectiveness
0 0013cc385424 007ACE74B050 Hi, i'm Isaac, i'm going to be writing about h... Lead Adequate
1 9704a709b505 007ACE74B050 On my perspective, I think that the face is a ... Position Adequate
2 c22adee811b6 007ACE74B050 I think that the face is a natural landform be... Claim Adequate
3 a10d361e54e4 007ACE74B050 If life was on Mars, we would know by now. The... Evidence Adequate
4 db3e453ec4e2 007ACE74B050 People thought that the face was formed by ali... Counterclaim Adequate
test.csv ...
discourse_id essay_id discourse_text discourse_type
0 a261b6e14276 D72CB1C11673 Making choices in life can be very difficult. ... Lead
1 5a88900e7dc1 D72CB1C11673 Seeking multiple opinions can help a person ma... Position
2 9790d835736b D72CB1C11673 it can decrease stress levels Claim
3 75ce6d68b67b D72CB1C11673 a great chance to learn something new Claim
4 93578d946723 D72CB1C11673 can be very helpful and beneficial. Claim
当然,针对这个比赛不是仅仅只用一个预训练模型可以解决的,我们这里主要是借这个数据集来简单做一个bert多分类的尝试。
那么一个思路就是使用bert微调,在bert的输出层中的分类头[CLS]取出,再映射到一层MLP中,例如在这个任务中,我们想要完成一个三分类的任务,希望我们输入的句子最后被分为Ineffective,Adequate,Effective三类。
那么直接上代码,这里的预训练模型我使用的是microsoft/deberta-base,在使用的角度来说,我们不是为了比较deberta和bert,所以就把它当做bert就行。
from transformers import AutoConfig, AutoModel, AutoTokenizer
import torch
import time
from transformers import get_cosine_schedule_with_warmup
from d2l import torch as d2l
import pandas as pd
# 定义下游任务模型
class Model(torch.nn.Module):
def __init__(self, checkpoint, config):
super().__init__()
self.pretrained = AutoModel.from_pretrained(checkpoint, config=config)
self.fc = torch.nn.Sequential(torch.nn.Linear(768, 3))
def forward(self, input_ids, attention_mask, token_type_ids):
logits = self.pretrained(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
logits = logits.last_hidden_state[:, 0]
logits = self.fc(logits)
logits = logits.softmax(dim=1)
return logits
# 定义数据集
class myDataset(torch.utils.data.Dataset):
def __init__(self, sentences, attention_mask, token_type_ids,label ):
super(myDataset, self).__init__()
self.sentences = torch.tensor(sentences)
self.attention_mask = torch.tensor(attention_mask)
self.token_type_ids = torch.tensor(token_type_ids)
self.label = torch.tensor(label)
def __len__(self):
return self.sentences.shape[0]
def __getitem__(self, idx):
return self.sentences[idx], self.attention_mask[idx], self.token_type_ids[idx], self.label[idx]
# 定义测试数据集
class testDataset(torch.utils.data.Dataset):
def __init__(self, sentences, attention_mask, token_type_ids):
super(testDataset, self).__init__()
self.sentences = torch.tensor(sentences)
self.attention_mask = torch.tensor(attention_mask)
self.token_type_ids = torch.tensor(token_type_ids)
def __len__(self):
return self.sentences.shape[0]
def __getitem__(self, idx):
return self.sentences[idx], self.attention_mask[idx], self.token_type_ids[idx]
# 读数据文件
def load_data(file_path, tokenizer):
df = pd.read_csv(file_path)
sentences = df['discourse_text'].tolist()
label_effectiveness = df['discourse_effectiveness'].replace({'Adequate':0, 'Effective':1, 'Ineffective':2}).tolist()
token_type_ids, attention_mask, input_ids = [], [], []
for sentence in sentences:
encode_dict = tokenizer.encode_plus(sentence, max_length=512, padding="max_length", truncation=True)
input_ids.append(encode_dict["input_ids"])
token_type_ids.append(encode_dict["token_type_ids"])
attention_mask.append(encode_dict["attention_mask"])
return input_ids, label_effectiveness, token_type_ids, attention_mask
# 训练函数
def train(net, train_iter, lr, weight_decay, num_epochs, devices):
total_time = 0
train_len = len(Inputid_train)
train_loss, train_acc = [], []
net = torch.nn.DataParallel(net.to(devices[0]))
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(net.parameters(), lr=lr, weight_decay=weight_decay)
schedule = get_cosine_schedule_with_warmup(
optimizer, num_warmup_steps=len(train_iter), num_training_steps=num_epochs*len(train_iter)
)
for epoch in range(num_epochs):
start_of_epoch = time.time()
cor = 0
loss_sum = 0
net.train()
for idx,(ids,att_mask,type,y) in enumerate(train_iter):
optimizer.zero_grad()
ids, att_mask,type, y = ids.to(devices[0]), att_mask.to(devices[0]),type.to(devices[0]),y.to(devices[0])
out_train = net(ids,att_mask,type)
l = loss(out_train, y)
l.backward()
optimizer.step()
schedule.step()
loss_sum += l.item()
if(idx + 1) % 20 == 0:
print("Epoch {:04d} | Step {:06d}/{:06d} | Loss {:.4f} | Time {:.0f}".format(epoch + 1, idx + 1, len(train_iter), loss_sum / (idx + 1), time.time() - start_of_epoch))
out_train = out_train.argmax(dim=1)
cor += (out_train == y).sum()
cor = float(cor)
acc = float(cor /train_len)
print(acc)
if epoch % 1 == 0:
print(f'epoch {epoch + 1}, train_loss {loss_sum / (len(train_iter))}, train_acc {acc}')
train_loss.append(loss_sum / len(train_iter))
train_acc.append(acc)
end_of_epoch = time.time()
print("epoch {} duration:".format(epoch + 1), end_of_epoch - start_of_epoch)
total_time += end_of_epoch - start_of_epoch
print("total training time: ",total_time)
# 测试函数
def eval(test_path, net, devices, test_batch_size):
df = pd.read_csv(test_path)
sentences = df['discourse_text'].tolist()
token_type_ids, attention_mask, input_ids = [], [], []
for sentence in sentences:
encode_dict = tokenizer.encode_plus(sentence, max_length=512, padding="max_length", truncation=True)
input_ids.append(encode_dict["input_ids"])
token_type_ids.append(encode_dict["token_type_ids"])
attention_mask.append(encode_dict["attention_mask"])
test_iter = torch.utils.data.DataLoader(testDataset(input_ids, attention_mask, token_type_ids), test_batch_size, True)
net.eval()
with torch.no_grad():
for ids, att, tpe in test_iter:
ids, att, tpe = ids.to(devices[0]), att.to(devices[0]), tpe.to(devices[0])
out_test = net(ids , att , tpe)
return out_test
checkpoint = 'microsoft/deberta-base'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)
train_path = '/home/cjw/kaggle/feedback/train.csv'
test_path = '/home/cjw/kaggle/feedback/test.csv'
Inputid_train, Labelid_train, typeids_train, inputmask_train = load_data(train_path, tokenizer)
batch_size = 8
dataset = myDataset(Inputid_train, inputmask_train, typeids_train, Labelid_train)
train_iter = torch.utils.data.DataLoader(dataset, batch_size, True)
net = Model(checkpoint, config)
num_epochs, lr, weight_decay, devices = 10, 2e-5, 1e-4, d2l.try_all_gpus()
print("baseline:",checkpoint)
print("training...")
train(net, train_iter, lr, weight_decay, num_epochs, devices)
print("evaling...")
predictions = eval(test_path, net, devices, 10).cpu()
submission = pd.read_csv('/home/cjw/kaggle/feedback/sample_submission.csv')
submission['Adequate'] = predictions[:, 0]
submission['Effective'] = predictions[:, 1]
submission['Ineffective'] = predictions[:, 2]
print(submission)
submission.to_csv('submission.csv', index=False)
这里只跑了10个epoch,因为数据集中text的长度大,跑起来也费时间,可以从结果看出loss和accuracy还在继续优化。
结果如下:
# 单任务
baseline: microsoft/deberta-base
training...
...
epoch 6, train_loss 0.9447965523255608, train_acc 0.6066639466884265
epoch 7, train_loss 0.934909532415649, train_acc 0.6165374677002584
epoch 8, train_loss 0.9426709712978235, train_acc 0.6087583299333605
epoch 9, train_loss 0.922136064427536, train_acc 0.6262749898000816
epoch 10, train_loss 0.9017313893615317, train_acc 0.6450700394396844
evaling...
discourse_id Ineffective Adequate Effective
0 a261b6e14276 7.290736e-07 0.997184 0.002815
1 5a88900e7dc1 9.837714e-07 0.013967 0.986032
2 9790d835736b 3.147095e-07 0.999499 0.000500
3 75ce6d68b67b 3.179148e-07 0.999486 0.000514
4 93578d946723 3.764216e-07 0.999225 0.000775
5 2e214524dbe3 3.639196e-07 0.999282 0.000717
6 84812fc2ab9f 3.567264e-06 0.959672 0.040324
7 c668ff840720 3.426721e-06 0.143535 0.856462
8 739a6d00f44a 2.438440e-06 0.978529 0.021468
9 bcfae2c9a244 2.324955e-06 0.070954 0.929044
紧接着我们来这样考虑一下,句子的“有效”、“充分”或“无效”三种情况与句子本身处于的话语元素肯定也是相关的,那么我们可不可以在训练中既将句子分类成为“有效”、“充分”或“无效”三种类别,并且也把它们分为七种话语元素类别呢,答案是肯定的。我们知道像bert在预训练时进行的就是一个多任务训练,mask language model任务和sequence prediction任务。
我们这样简单尝试一下:两个任务都是分类,一般情况下,loss值应该是接近的,收敛速度可能也差不多,我们将两个分类任务的loss直接相加,然后优化。在模型上,我们设置两个MLP层,两个分类任务各使用一个MLP层。
接下来验证我们的想法:
from transformers import AutoConfig, AutoModel, AutoTokenizer
import torch
import time
from transformers import get_cosine_schedule_with_warmup
from d2l import torch as d2l
import pandas as pd
# 定义下游任务模型
class Model(torch.nn.Module):
def __init__(self, checkpoint, config):
super().__init__()
self.pretrained = AutoModel.from_pretrained(checkpoint, config=config)
self.fc_a = torch.nn.Sequential(torch.nn.Linear(768, 3))
self.fc_b = torch.nn.Sequential(torch.nn.Linear(768, 7))
def forward(self, input_ids, attention_mask, token_type_ids, class_num=3):
logits = self.pretrained(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
logits = logits.last_hidden_state[:, 0]
if class_num == 3:
logits = self.fc_a(logits)
elif class_num == 7:
logits = self.fc_b(logits)
logits = logits.softmax(dim=1)
return logits
# 定义数据集
class myDataset(torch.utils.data.Dataset):
def __init__(self, sentences, attention_mask, token_type_ids, label_effectiveness, label_type):
super(myDataset, self).__init__()
self.sentences = torch.tensor(sentences)
self.attention_mask = torch.tensor(attention_mask)
self.token_type_ids = torch.tensor(token_type_ids)
self.label_effectiveness = torch.tensor(label_effectiveness)
self.label_type = torch.tensor(label_type)
def __len__(self):
return self.sentences.shape[0]
def __getitem__(self, idx):
return self.sentences[idx], self.attention_mask[idx], self.token_type_ids[idx], self.label_effectiveness[idx], self.label_type[idx]
# 定义测试数据集
class testDataset(torch.utils.data.Dataset):
def __init__(self, sentences, attention_mask, token_type_ids):
super(testDataset, self).__init__()
self.sentences = torch.tensor(sentences)
self.attention_mask = torch.tensor(attention_mask)
self.token_type_ids = torch.tensor(token_type_ids)
def __len__(self):
return self.sentences.shape[0]
def __getitem__(self, idx):
return self.sentences[idx], self.attention_mask[idx], self.token_type_ids[idx]
# 读数据文件
def load_data(file_path, tokenizer):
df = pd.read_csv(file_path)
sentences = df['discourse_text'].tolist()
label_effectiveness = df['discourse_effectiveness'].replace({'Adequate':0, 'Effective':1, 'Ineffective':2}).tolist()
label_type = df['discourse_type'].replace({'Lead':0, 'Position':1, 'Claim':2, 'Counterclaim':3, 'Rebuttal':4, 'Evidence':5, 'Concluding Statement':6})
token_type_ids, attention_mask, input_ids = [], [], []
for sentence in sentences:
encode_dict = tokenizer.encode_plus(sentence, max_length=512, padding="max_length", truncation=True)
input_ids.append(encode_dict["input_ids"])
token_type_ids.append(encode_dict["token_type_ids"])
attention_mask.append(encode_dict["attention_mask"])
return input_ids, label_effectiveness, token_type_ids, attention_mask, label_type
# 训练函数
def train(net, train_iter, lr, weight_decay, num_epochs, devices):
total_time = 0
train_len = len(Inputid_train)
net = torch.nn.DataParallel(net.to(devices[0]))
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(net.parameters(), lr=lr, weight_decay=weight_decay)
schedule = get_cosine_schedule_with_warmup(
optimizer, num_warmup_steps=len(train_iter), num_training_steps=num_epochs*len(train_iter)
)
for epoch in range(num_epochs):
start_of_epoch = time.time()
loss_sum = 0
cor_a = 0
net.train()
for idx,(ids,att_mask,type,y_a, y_b) in enumerate(train_iter):
optimizer.zero_grad()
ids, att_mask,type, y_a, y_b = ids.to(devices[0]), att_mask.to(devices[0]), type.to(devices[0]), y_a.to(devices[0]), y_b.to(devices[0])
output_a = net(ids, att_mask, type)
output_b = net(ids, att_mask, type, class_num = 7)
l_a = loss(output_a, y_a)
l_b = loss(output_b, y_b)
l = l_a + l_b
l.backward()
optimizer.step()
schedule.step()
loss_sum += l.item()
if(idx + 1) % 20 == 0:
print("Epoch {:04d} | Step {:06d}/{:06d} | Loss {:.4f} | Time {:.0f}".format(epoch + 1, idx + 1, len(train_iter), loss_sum / (idx + 1), time.time() - start_of_epoch))
output_a = output_a.argmax(dim=1)
cor_a += (output_a == y_a).sum()
acc_a = float(cor_a /train_len)
if epoch % 1 == 0:
print(f'epoch {epoch + 1}, train_loss {loss_sum / (len(train_iter))}, train_acc_a {acc_a}')
end_of_epoch = time.time()
print("epoch {} duration:".format(epoch + 1), end_of_epoch - start_of_epoch)
total_time += end_of_epoch - start_of_epoch
print("total training time: ",total_time)
# 测试函数
def eval(test_path, net, devices, test_batch_size):
df = pd.read_csv(test_path)
sentences = df['discourse_text'].tolist()
token_type_ids, attention_mask, input_ids = [], [], []
for sentence in sentences:
encode_dict = tokenizer.encode_plus(sentence, max_length=512, padding="max_length", truncation=True)
input_ids.append(encode_dict["input_ids"])
token_type_ids.append(encode_dict["token_type_ids"])
attention_mask.append(encode_dict["attention_mask"])
test_iter = torch.utils.data.DataLoader(testDataset(input_ids, attention_mask, token_type_ids), test_batch_size, True)
net.eval()
with torch.no_grad():
for ids, att, tpe in test_iter:
ids, att, tpe = ids.to(devices[0]), att.to(devices[0]), tpe.to(devices[0])
out_test = net(ids , att , tpe)
return out_test
checkpoint = 'microsoft/deberta-base'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)
train_path = '/home/cjw/kaggle/feedback/train.csv'
test_path = '/home/cjw/kaggle/feedback/test.csv'
Inputid_train, Labelid_train, typeids_train, inputmask_train, label_type = load_data(train_path, tokenizer)
batch_size = 8
train_iter = torch.utils.data.DataLoader(myDataset(Inputid_train, inputmask_train, typeids_train, Labelid_train, label_type), batch_size, drop_last = True)
net = Model(checkpoint, config)
num_epochs, lr, weight_decay, devices = 10, 2e-5, 1e-4, d2l.try_all_gpus()
print("baseline:",checkpoint)
print("training...")
train(net, train_iter, lr, weight_decay, num_epochs, devices)
print("evaling...")
predictions = eval(test_path, net, devices, 10).cpu()
submission = pd.read_csv('/home/cjw/kaggle/feedback/sample_submission.csv')
submission['Adequate'] = predictions[:, 0]
submission['Effective'] = predictions[:, 1]
submission['Ineffective'] = predictions[:, 2]
print(submission)
submission.to_csv('submission.csv', index=False)
同样只跑了10个epoch,我们控制变量,两个代码只有单任务和多任务的区别,也可以从结果看出,loss和accuracy还在继续优化,但我们这里只是比较一下单任务和多任务,所以不关注它最好的结果。
# 多任务
baseline: microsoft/deberta-base
training...
...
epoch 6, train_loss 2.396573264969835, train_acc_a 0.7231334447860718
epoch 7, train_loss 2.3707293589823393, train_acc_a 0.7420101165771484
epoch 8, train_loss 2.350535287120267, train_acc_a 0.7575955390930176
epoch 9, train_loss 2.3366618679968134, train_acc_a 0.768774688243866
epoch 10, train_loss 2.3300860160063865, train_acc_a 0.7741058468818665
evaling...
discourse_id Ineffective Adequate Effective
0 a261b6e14276 0.000032 0.749933 0.250035
1 5a88900e7dc1 0.000004 0.000244 0.999752
2 9790d835736b 0.000010 0.999960 0.000030
3 75ce6d68b67b 0.000047 0.993598 0.006354
4 93578d946723 0.000030 0.726694 0.273276
5 2e214524dbe3 0.000015 0.000096 0.999889
6 84812fc2ab9f 0.000065 0.988569 0.011366
7 c668ff840720 0.000003 0.000084 0.999913
8 739a6d00f44a 0.000016 0.999982 0.000002
9 bcfae2c9a244 0.000053 0.888959 0.110988
我们通过对比两个结果的准确率,通过多任务训练的结果要比单任务训练的结果高13个百分点,可以得出在这个场景中,多提取一维特征对任务带来的提升。
当然在此重申,我们这里主要是为了学习利用预训练模型进行多任务学习,如果去进行比赛或者落地时这样做太过奢侈,最好先进行特征过程再输入预训练模型,不然一定会超出显存设置或者要跑很长时间。
另外如果两个任务差别比较大,学习率等超参数不一定要设置为一样。