经上一篇博文,相信模型的环境准备已经完成啦!
接下来我们需要根据不同的任务,使用BERT提高准确率。
此篇讲的是文本的分类:
1.数据的格式
需要准备3个文件,分别是训练集、验证集、测试集,格式相同,每行为一个类别+文本,用“\t”间隔。(如果选择其他间隔符,需要修改run_classifier.py中_read_tsv方法)。
2.修改run_classifier.py
(1)添加处理数据的类,class MyProcessor(), 如下:(说明:数据文件读取的类DataProcessor, 官方自带了4个不同数据集(Xnli, Mnli, Mrpc和Cola)的子类)
class MyProcessor(DataProcessor):
"""Processor for my data set."""
def get_train_examples(self, data_dir):
examples = []
file_path = os.path.join(data_dir, 'train.csv')
df = pd.read_csv(file_path, encoding='utf-8')
for i, data in enumerate(df.values):
guid = 'train-%d' % (i)
text_a = tokenization.convert_to_unicode(str(data[1]))
label = str(data[2:])
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
def get_test_examples(self, data_dir):
examples = []
file_path = os.path.join(data_dir, 'test.csv')
df = pd.read_csv(file_path, encoding='utf-8')
for i, data in enumerate(df.values):
guid = 'test-%d' % (i)
text_a = tokenization.convert_to_unicode(str(data[1]))
label = str(data[2:])
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
def get_dev_examples(self, data_dir):
examples = []
file_path = os.path.join(data_dir, 'dev.csv')
df = pd.read_csv(file_path, encoding='utf-8')
for i, data in enumerate(df.values):
guid = 'dev-%d' % (i)
text_a = tokenization.convert_to_unicode(str(data[1]))
label = str(data[2:])
examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
return examples
def get_labels(self):
"""See base class."""
return ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
(2)在main函数中添加MyProcessor到processors里面,如下:
def main(_):
tf.logging.set_verbosity(tf.logging.INFO)
processors = {
"cola": ColaProcessor,
"mnli": MnliProcessor,
"mrpc": MrpcProcessor,
"xnli": XnliProcessor,
"my": MyProcessor,
}
(3)文本特征的转换,源码分析如下:
file_based_convert_examples_to_features 函数将sample转化为特征(具体是调用 convert_single_example 函数),如下:
# 特征的转化
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
# 数据不够组成一个batch的时候,创建假的样例 running eval/predict on the TPU, we need to pad the number of examples
if isinstance(example, PaddingInputExample):
return InputFeatures(
input_ids=[0] * max_seq_length,
input_mask=[0] * max_seq_length,
segment_ids=[0] * max_seq_length,
label_id=0,
is_real_example=False)
# label_list 字典 将标签数字化
label_map = {}
for (i, label) in enumerate(label_list):
label_map[label] = i
tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)
if tokens_b:
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS], [SEP], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
# The convention in BERT is: 规定句子如下
# (a) For sequence pairs:
# tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
# type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1
# (b) For single sequences:
# tokens: [CLS] the dog is hairy . [SEP]
# type_ids: 0 0 0 0 0 0 0
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)
# Zero-pad up to the sequence length.
# input_ids 字符的id
# input_mask 是否是真实的字符 The mask has 1 for real tokens and 0 for padding tokens
# segment_ids text_a or text_b
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length
label_id = label_map[example.label]
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
# input_ids 字符的id
# input_mask 是否是真实的字符 The mask has 1 for real tokens and 0 for padding tokens
# segment_ids text_a or text_b
# label_id = label_map[example.label]
# # label_list 字典 将标签数字化
# label_map = {}
# for (i, label) in enumerate(label_list):
# label_map[label] = i
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,
is_real_example=True)
return feature
(4)调整运行的参数:
data_dir:存放数据集的文件夹
bert_config_file:bert中文模型中的bert_config.json文件
task_name:processors中添加的任务名“zbs”
vocab_file:bert中文模型中的vocab.txt文件
output_dir:训练好的分类器模型的存放文件夹
init_checkpoint:bert中文模型中的bert_model.ckpt.index文件
do_train:是否训练,设置为“True”
do_eval:是否验证,设置为“True”
do_predict:是否测试,设置为“False”
max_seq_length:输入文本序列的最大长度,也就是每个样本的最大处理长度,多余会去掉,不够会补齐。最大值512。
train_batch_size: 训练模型求梯度时,批量处理数据集的大小。值越大,训练速度越快,内存占用越多。
eval_batch_size: 验证时,批量处理数据集的大小。同上。
predict_batch_size: 测试时,批量处理数据集的大小。同上。
learning_rate: 反向传播更新权重时,步长大小。值越大,训练速度越快。值越小,训练速度越慢,收敛速度慢,容易过拟合。迁移学习中,一般设置较小的步长(小于2e-4)
num_train_epochs:所有样本完全训练一遍的次数。
warmup_proportion:用于warmup的训练集的比例。
save_checkpoints_steps:检查点的保存频率。
(5)运行参数:
export BERT_BASE_DIR=./uncased_L-12_H-768_A-12
export MY_DATA=./dataset
python run_classifier.py \
--task_name=my \
--do_train=true \
--do_eval=true \
--data_dir=$MY_DATA \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--output_dir=./MY_OUTPUT