基于transformer和相关预训练模型的任务调优

使用的环境依赖:

python3.9
'''
对应的依赖:
tensorflow==2.11.0
transformers==4.26.0
pandas==1.3.5
scikit-learn==1.0.2
'''

模型的训练代码如下:

from transformers import BertTokenizer,TFBertForSequenceClassification
import tensorflow as tf
from sklearn.model_selection import train_test_split
import pandas as pd
max_length = 40
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')


'''
对应的依赖:
tensorflow==2.11.0
transformers==4.26.0
pandas==1.3.5
scikit-learn==1.0.2
'''

def split_dataset(df):
    train_set, x = train_test_split(df,
        stratify=df['label'],
        test_size=0.1,
        random_state=42)
    val_set, test_set = train_test_split(x,
        stratify=x['label'],
        test_size=0.5,
        random_state=43)

    return train_set,val_set, test_set


df_raw = pd.read_csv("data/originalthuctcdata/THUCTC_subdata.txt",sep="\t",header=None,names=["text","label"])
# label
df_label = pd.DataFrame({"label":["财经","房产","股票","教育","科技","社会","时政","体育","游戏","娱乐"],"y":list(range(10))})
df_raw = pd.merge(df_raw,df_label,on="label",how="left")

train_data,val_data, test_data = split_dataset(df_raw)


def convert_example_to_feature(review):
    return tokenizer.encode_plus(review,
                                 add_special_tokens=True,  # add [CLS], [SEP]
                                 padding='max_length',
                                 max_length=max_length,  # max length of the text that can go to BERT
                                 # pad_to_max_length=True,
                                 return_attention_mask=True,  # add attention mask to not focus on pad tokens
                                 )


# map to the expected input to TFBertForSequenceClassification, see here
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
    return {
        "input_ids": input_ids,
        "token_type_ids": token_type_ids,
        "attention_mask": attention_masks,
    }, label


def encode_examples(ds, limit=-1):
    # prepare list, so that we can build up final TensorFlow dataset from slices.
    input_ids_list = []
    token_type_ids_list = []
    attention_mask_list = []
    label_list = []
    if (limit > 0):
        ds = ds.take(limit)

    for index, row in ds.iterrows():
        review = row["text"]
        label = row["y"]
        bert_input = convert_example_to_feature(review)

        input_ids_list.append(bert_input['input_ids'])
        token_type_ids_list.append(bert_input['token_type_ids'])
        attention_mask_list.append(bert_input['attention_mask'])
        label_list.append([label])
    return tf.data.Dataset.from_tensor_slices(
        (input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)

# train dataset
batch_size=100
ds_train_encoded = encode_examples(train_data).shuffle(10000).batch(batch_size)

# val dataset
ds_val_encoded = encode_examples(val_data).batch(batch_size)
# test dataset
ds_test_encoded = encode_examples(test_data).batch(batch_size)


# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch for illustration, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 1

# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=10)

# optimizer Adam recommended
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate,epsilon=1e-08, clipnorm=1)

# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# fit model
bert_history = model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_val_encoded)
# evaluate test set
model.evaluate(ds_test_encoded)
tf.keras.models.save_model(model,filepath="my_model")

其中,模型训练可以根据个人的设备适当调整batch大小。基于transformer的bert相关模型的输入是

{"input_ids":[[]],"token_type_ids":[[]],"attention_mask":[[]]}

  1. input_ids:表示的是输入文本进行分词处理并按照指定的长度要求进行padding或truncate后的结果

input_ids = tokenizer.convert_tokens_to_ids(tokenized)
  
# precalculation of pad length, so that we can reuse it later on
padding_length = max_length_test - len(input_ids)

# map tokens to WordPiece dictionary and add pad token for those text shorter than our max length
input_ids = input_ids + ([0] * padding_length)
  1. token_type_ids:表示的是当前词是第几个句子,一般在有多个句子作为模型输入时用来区分句子的

  1. attention_mask:表示的是为了区分input_ids的padding和非padding数据

# attention should focus just on sequence with non padded tokens
attention_mask = [1] * len(input_ids)

# do not focus attention on padded tokens
attention_mask = attention_mask + ([0] * padding_length)

基于训练好的模型的预测:

from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
max_length = 40
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

my_model = tf.keras.models.load_model(filepath="my_model")
input_message = "82岁老太为学生做饭扫地44年获授港大荣誉院士"

bert_input = tokenizer.encode_plus(input_message,
                                   add_special_tokens=True,  # add [CLS], [SEP]
                                   padding='max_length',
                                   max_length=max_length,  # max length of the text that can go to BERT
                                   # pad_to_max_length=True,
                                   return_attention_mask=True,  # add attention mask to not focus on pad tokens
                                   )
predict_result = my_model({
    "input_ids":[bert_input['input_ids']],
    "token_type_ids":[bert_input['token_type_ids']],
    "attention_mask":[bert_input['attention_mask']],
          })

print(predict_result)

相关模型也可以部署到tf-serving中

其中模型返回的结果是一个logits结果,也就是没有经过softmax处理,所以如果要按照probability返回结果的话,可以手动增加一个soft计算tf.math.softmax(predict_result["logits"],axis=1)

相关完整代码:bert_related_task: 使用基于bert的预训练模型,对各个方向的任务进行二次训练,获取特定任务的模型 (gitee.com)

关于基于bert的模型二次封装和结构调整,下一次给大家介绍

你可能感兴趣的:(tensorflow2.x,NLP,分类,transformer)