Kaggle比赛Google QUEST Q&A Labeling是一个文本分类问题(多标签分类问题)
数量:训练样本6079条,测试样本476条
特征(10个维度):文本部分question_title,question_body,answer;非文本部分question_user_name,qustion_user_page,answer_user_name,answer_user_page,url,category,host
标签(30个维度):21个问题相关的标签,9个答案相关的标签
3.评判标准
采用斯皮尔曼等级相关系数Spearman,对每个标签计算相关系数,并求平均值即为最终分数。
我们常见的相关系数为皮尔逊Person相关系数 ρ x , y = c o v ( X , Y ) σ x σ y \rho_{x,y}={\textstyle\frac{cov(X,Y)}{\sigma_x\sigma_y}} ρx,y=σxσycov(X,Y),而斯皮尔曼相关系数 r s = 1 − 6 ∑ d i 2 n ( n 2 − 1 ) r_s=1-\frac{6\sum d^2_i}{n(n^2-1)} rs=1−n(n2−1)6∑di2,其中 d i d_i di为每个观察值的两个等级之间的差异。斯皮尔曼相关系数系数用于评价X和Y排序的相关性,排序相同为1,排序完全相反为-1.
除了question_type_spelling标签最大值为0.67外,其他标签的最大值均为1
df_train.iloc[:,11:].max()
question_asker_intent_understanding 1.000000
question_body_critical 1.000000
question_conversational 1.000000
question_expect_short_answer 1.000000
question_fact_seeking 1.000000
question_has_commonly_accepted_answer 1.000000
question_interestingness_others 1.000000
question_interestingness_self 1.000000
question_multi_intent 1.000000
question_not_really_a_question 1.000000
question_opinion_seeking 1.000000
question_type_choice 1.000000
question_type_compare 1.000000
question_type_consequence 1.000000
question_type_definition 1.000000
question_type_entity 1.000000
question_type_instructions 1.000000
question_type_procedure 1.000000
question_type_reason_explanation 1.000000
question_type_spelling 0.666667
question_well_written 1.000000
answer_helpful 1.000000
answer_level_of_information 1.000000
answer_plausible 1.000000
answer_relevance 1.000000
answer_satisfaction 1.000000
answer_type_instructions 1.000000
answer_type_procedure 1.000000
answer_type_reason_explanation 1.000000
answer_well_written 1.000000
dtype: float64
除小部分标签的最小值为0.33外,其余均为0
df_train.iloc[:,11:].min()
question_asker_intent_understanding 0.333333
question_body_critical 0.333333
question_conversational 0.000000
question_expect_short_answer 0.000000
question_fact_seeking 0.000000
question_has_commonly_accepted_answer 0.000000
question_interestingness_others 0.333333
question_interestingness_self 0.333333
question_multi_intent 0.000000
question_not_really_a_question 0.000000
question_opinion_seeking 0.000000
question_type_choice 0.000000
question_type_compare 0.000000
question_type_consequence 0.000000
question_type_definition 0.000000
question_type_entity 0.000000
question_type_instructions 0.000000
question_type_procedure 0.000000
question_type_reason_explanation 0.000000
question_type_spelling 0.000000
question_well_written 0.333333
answer_helpful 0.333333
answer_level_of_information 0.333333
answer_plausible 0.333333
answer_relevance 0.333333
answer_satisfaction 0.200000
answer_type_instructions 0.000000
answer_type_procedure 0.000000
answer_type_reason_explanation 0.000000
answer_well_written 0.333333
dtype: float64
所有标签的值有限,为0-1之间的特定值。由于标签值为人为打分,因此打分选项有限。可以对预测值进行后处理提升结果准确性。
df_train.iloc[:,11:].nunique()
question_asker_intent_understanding 9
question_body_critical 9
question_conversational 5
question_expect_short_answer 5
question_fact_seeking 5
question_has_commonly_accepted_answer 5
question_interestingness_others 9
question_interestingness_self 9
question_multi_intent 5
question_not_really_a_question 5
question_opinion_seeking 5
question_type_choice 5
question_type_compare 5
question_type_consequence 5
question_type_definition 5
question_type_entity 5
question_type_instructions 5
question_type_procedure 5
question_type_reason_explanation 5
question_type_spelling 3
question_well_written 9
answer_helpful 9
answer_level_of_information 9
answer_plausible 9
answer_relevance 9
answer_satisfaction 17
answer_type_instructions 5
answer_type_procedure 5
answer_type_reason_explanation 5
answer_well_written 9
dtype: int64
经过各个特征和标签的观察,question_user_name,qustion_user_page,answer_user_name,answer_user_page提问者和回答者的用户名和网址及url过于分散,不用于训练。而category,host可作为特征输入模型进行训练。
由于标签的值有限,所有维度为0-1之间的特定值,最后将预测结果离散化。
bert最大长度512,由于问题和答案文本多,因此选择Seq Length为512,建议的对应最大的Batch Size为6。
1.首先准备数据,对数据处理得到Bert模型的输入input_ids、token_type_ids和attention_mask。
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased-vocab.txt')
# 问题标题和正文拼接后作为第一部分,答案作为第二部分进行输入,encode_text由input_ids(文本中出现单词的id,填充的词为0)、token_type_ids(第一部分文本为0,第二部分文本为1,填充部分为0)和attention_mask(已有的词为1,pad的词为0)
encode_text = tokenizer.encode_plus(question_tile + ' ' +question_body, answer, max_lenght=512, pad_to_max_lengh=True)
2.构建模型输入层(input_ids 、attention_mask 、token_type_ids ),中间层(bert_model),输出层(Dense输出30个维度的标签)
MAX_LENGTH = 512
from transformers import BertConfig, TFBertModel
config = BertConfig.from_pretrained('bert-base-uncased-config.json')
bert_model = TFBertModel.from_pretrained('bert-base-uncased-tf_model.h5', config=cinfig)
from tensorflow import keras
input_ids = keras.layers.Input(shape=(MAX_LENGTH,), dtype='int32')
attention_mask = keras.layers.Input(shape=(MAX_LENGTH,), dtype='int32')
token_type_ids = keras.layers.Input(shape=(MAX_LENGTH,), dtype='int32')
_, x = bert_model(input_ids, attention_mask, token_type_ids)
outputs = keras.layers.Dense(30, activation='sigmoid')(x)
model = keras.models.Model(input=[input_ids, attention_mask, token_type_ids], outputs=outputs)
model.compile(loss='binary_crossentropy', optimizer=keras.optimizers.Adam(lr=0.001))
3.模型训练及预测
BATCH_SIZE = 6
early_stopping = keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True)
# 模型训练
model.fit([train_input_ids, train_attention_mask, train_token_type_ids], y_train,
validation_data=([valid_input_ids, valid_attention_mask, valid_token_type_ids],y_valid),
batch_size=BATCH_SIZE, epochs=5, callbacks=[early_stopping])
# 预测
y_pred = model.predict([test_input_ids, test_attention_mask, test_token_type_ids], batch_size=BATCH_SIZE)
由于question_title,question_body,answer合在一起的文本长度均值在300上下,最大长度为8000。而且问题维度的预测与question相关,而答案维度的预测与answer相关。因此将30维度的预测分解为2个问题,使用question_title和question_body训练模型预测问题相关维度,使用question_title和answer训练模型预测答案相关维度。每个问题训练2种模型(roberta-base和roberta-large),使用5折交叉校验。
Roberta-base预训练模型
tokenizer = RobertaTokenizer(BERT_PATH + "roberta-base-vocab.json", BERT_PATH + "roberta-base-merges.txt")
def bert_model_question():
input_word_ids = tf.keras.layers.Input(
(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_word_ids')
input_masks = tf.keras.layers.Input(
(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_masks')
feature_embeddings = tf.keras.layers.Input(
(feature_shape,), dtype=tf.float32, name='feature_embeddings')
bert_model = TFRobertaModel.from_pretrained(BERT_PATH, output_hidden_states=True)
hidden_output = bert_model([input_word_ids, input_masks])[2]
last_cat_sequence = tf.concat(
(hidden_output[-1], hidden_output[-2], hidden_output[-3]),
2,
)
x = tf.keras.layers.GlobalAveragePooling1D()(last_cat_sequence)
x = tf.concat((x, feature_embeddings), 1)
out = tf.keras.layers.Dense(21, activation="sigmoid", name="dense_output")(x)
model = tf.keras.models.Model(
inputs=[input_word_ids, input_masks, feature_embeddings], outputs=out)
return model
在文本特征的基础上添加特征,文本相似度,使用Universal Sentence Encoder得到文本向量表示,计算两个向量的欧氏距离和余弦距离得到文本相似度。问题相关维度和答案相关维度建立的特征维度为(n_samples, 512*2),相似度的特征维度为(n_samples, 2)。
对于非文本特征category进行OneHotEncoding,得到特征维度(n_samples, 64)。