首先讲一讲谷歌bert干了什么。他们主要的工作是预训练出了一个模型,即对一个很大的语料库进行训练,得到相应的词向量。这一步,主要用到mask,transformer等机制。
我们要做的是fintune即微调,在这些词向量的基础上,再构建基于自己数据集上下文的bert模型,根据具体任务如QA,分类等,进行训练,得出结果。
首先,程序入口为run_classifier.py,找到main函数进行debug。
注意,这里使用mrpc数据集,这是一个用于二分类的数据集。这个数据集相应的processor函数已经由google给出,包括读取,分类等,所以这里不再自己另写processor函数。
if FLAGS.do_train:
train_examples = processor.get_train_examples(FLAGS.data_dir)#读入数据
num_train_steps = int(#迭代次数=(样本数/batch_size)*epochs
len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)
#设置学习率,刚开始学习率偏小,后面会还愿
num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)
接着建立模型等一些工作(这一节略,后面会详细讲modeling)
继续往下,这行代码提取特征,仅以mrpc为例。(此处需要基本理解tensorflow中的tfrecord原理)
file_based_convert_examples_to_features(
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
而这个函数中的主体部分是下面这个函数:
def convert_single_example(ex_index, example, label_list, max_seq_length,
tokenizer):
STEP1: wordpiece分词:将text_a和text_b连接起来,得到***tokens***结构
label_map = {}#构建标签,list转dict,0和1两个标签
for (i, label) in enumerate(label_list):
label_map[label] = i
#分词:word piece方法:英语中,单一的一个词会有第三人称,现在时等等,即有多个特征含义。
tokens_a = tokenizer.tokenize(example.text_a)#第一句分词
tokens_b = None
#是否有第二句话
if example.text_b:
tokens_b = tokenizer.tokenize(example.text_b)#第二句分词
# 是否句子太长,太长要进行截断,有b要保留三个字符,没有则保留两个
if tokens_b:#有没有b句子
# Modifies `tokens_a` and `tokens_b` in place so that the total
# length is less than the specified length.
# Account for [CLS:0,1,分类标签], [SEP连接符], [SEP] with "- 3"
_truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)#截断操作(这个函数很简单,不特别拿出来了)
else:
# Account for [CLS] and [SEP] with "- 2"
if len(tokens_a) > max_seq_length - 2:
tokens_a = tokens_a[0:(max_seq_length - 2)]
其中使用到函数:tokenize()用于给一句英文句子进行分词,可提取出第三人称,进行时等等信息
def tokenize(self, text):
split_tokens = []
for token in self.basic_tokenizer.tokenize(text):
for sub_token in self.wordpiece_tokenizer.tokenize(token):
split_tokens.append(sub_token)#切分成多个词
return split_tokens
以下是一个例子:
text_example:‘Amrozi accused his brother, whom he called “the witness”, of deliberately distorting his evidence.’
: [‘am’, ‘##ro’, ‘##zi’, ‘accused’, ‘his’, ‘brother’, ‘,’, ‘whom’, ‘he’, ‘called’, ‘"’, ‘the’, ‘witness’, ‘"’, ‘,’, ‘of’, ‘deliberately’, ‘di’, ‘##stor’, ‘##ting’, ‘his’, ‘evidence’, ‘.’](转换成wordpiece结构)
STEP2: 建立type_id:得到***segment_ids***用于区分每个词是第几句话(仅针对这一个数据集任务,只有0和1两个标签,如果是别的任务,可能有上万个标签)
tokens = []
segment_ids = []
tokens.append("[CLS]")#第一个位置
segment_ids.append(0)#编码为0
for token in tokens_a:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")#添加连接符
segment_ids.append(0)#注释中的type_id,都是0
if tokens_b:
for token in tokens_b:
tokens.append(token)
segment_ids.append(1)
#最后一个终止符也是1
tokens.append("[SEP]")
segment_ids.append(1)#注释中的type_id,都是1
#将所有id输入都转换成映射方便embedding(使用预训练模型中vocab.txt语料库得到索引)
input_ids = tokenizer.convert_tokens_to_ids(tokens)
以下是一个例子:
tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0表示前一句话,1表示后一句话
语料库中索引形式如下:
: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102]
STEP3: 建立mask:
#此处根据不同nlp任务可进行修改,以得到不同的feature
# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
#mask编码:例如,将text_a和text_b拼接后,没有达到最大长度,则进行补位,这些补位的数据必须能被self_attention分辨出来,是不重要的。
#此处指定,mask为1是实际的值
input_mask = [1] * len(input_ids)
#将三个特征全部进行填充,即补0,表示这个padding,是不重要的
#后续self——attention也需要用到
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
接下来进行打印
label_id = label_map[example.label]
if ex_index < 5:
tf.logging.info("*** Example ***")
tf.logging.info("guid: %s" % (example.guid))
tf.logging.info("tokens: %s" % " ".join(
[tokenization.printable_text(x) for x in tokens]))
tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
WARNING:tensorflow:Estimator’s model_fn (
) includes params argument, but params are not passed to Estimator.
INFO:tensorflow:Using config: {’_num_worker_replicas’: 1, ‘_task_type’: ‘worker’, ‘_session_config’: None, ‘_global_id_in_cluster’: 0, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_master’: ‘’, ‘_device_fn’: None, ‘_is_chief’: True, ‘_num_ps_replicas’: 0, ‘_save_checkpoints_secs’: None, ‘_log_step_count_steps’: None, ‘_service’: None, ‘_train_distribute’: None, ‘_evaluation_master’: ‘’, ‘_tf_random_seed’: None, ‘_keep_checkpoint_max’: 5, ‘_cluster_spec’:, ‘_model_dir’: ‘./tmp/mrpc_output’, ‘_tpu_config’: TPUConfig(iterations_per_loop=1000, num_shards=8, num_cores_per_replica=None, per_host_input_for_training=3, tpu_job_name=None, initial_infeed_sleep_secs=None), ‘_save_checkpoints_steps’: 1000, ‘_task_id’: 0, ‘_save_summary_steps’: 100, ‘_cluster’: None}
INFO:tensorflow:_TPUContext: eval_on_tpu True
WARNING:tensorflow:eval_on_tpu ignored because use_tpu is False.
INFO:tensorflow:Writing example 0 of 3668
INFO:tensorflow:*** Example ***
INFO:tensorflow:guid: train-1
INFO:tensorflow:tokens: [CLS] am ##ro ##zi accused his brother , whom he called " the witness " , of deliberately di ##stor ##ting his evidence . [SEP] referring to him as only " the witness " , am ##ro ##zi accused his brother of deliberately di ##stor ##ting his evidence . [SEP]
INFO:tensorflow:input_ids: 101 2572 3217 5831 5496 2010 2567 1010 3183 2002 2170 1000 1996 7409 1000 1010 1997 9969 4487 23809 3436 2010 3350 1012 102 7727 2000 2032 2004 2069 1000 1996 7409 1000 1010 2572 3217 5831 5496 2010 2567 1997 9969 4487 23809 3436 2010 3350 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow 1 (id = 1)
STEP4: 建立feature:生成一个feature,return回file_based_convert_examples_to_features()函数,并往tfrecord里写入这个数据。
feature = InputFeatures(
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
label_id=label_id,#是不是同意句
is_real_example=True)
return feature
file_based_convert_examples_to_features()函数中将feature转换成Int类型
def create_int_feature(values):
f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
return f
接着写feature进tfrecord中。(代码中详细注释了对于分类任务,得到的几种feature,如果是其他任务如QA,则需要进行转换)
features = collections.OrderedDict()
#转换成int类型,根据tensorflow指定的tfrecord代码格式编写
features["input_ids"] = create_int_feature(feature.input_ids)#输入索引,在语料库中查找得到input_ids = tokenizer.convert_tokens_to_ids(tokens)
features["input_mask"] = create_int_feature(feature.input_mask)#mask
features["segment_ids"] = create_int_feature(feature.segment_ids)#type_id.是哪句话
features["label_ids"] = create_int_feature([feature.label_id])#样本的label,是不是同义句
features["is_real_example"] = create_int_feature(
[int(feature.is_real_example)])
#并写入tfrecord中
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
writer.write(tf_example.SerializeToString())
writer.close()
主要参考:b站:视频号76791626
URL:https://www.bilibili.com/video/av76791626?p=16