论文:https://arxiv.org/pdf/1810.04805.pdf
官方代码:GitHub - google-research/bert: TensorFlow code and pre-trained models for BERT
在run_squad.py的read_squad_examples(input_file,is_training)函数是处理train-v1.1.json文件的。
如果阅读过【Bert】(八)简易问答系统--数据介绍及标注_mjiansun的专栏-CSDN博客这篇博客,应该大致了解了train-v1.1.json的大致格式,在有了这个概念的基础上,下面的内容更容易理解。
paragraph_text = paragraph["context"]
doc_tokens = [] #存储使用空格分开的单词
char_to_word_offset = [] #由于原始的起始位置是根据字符来得,这里是为了记录字符和单词之间的对应关系
prev_is_whitespace = True #前面一个单词是否是空格
for c in paragraph_text:
if is_whitespace(c):
prev_is_whitespace = True
else:
if prev_is_whitespace:
doc_tokens.append(c)
else:
doc_tokens[-1] += c
prev_is_whitespace = False
char_to_word_offset.append(len(doc_tokens) - 1)
上述代码是将一段话通过空格进行分词,然后通过char_to_word_offset来记录字符与单词之间的对应关系,数组长度与一段话的字符的总长度相等,内部的取值表明是该字符属于第几个单词。
备注:空格属于空格前的一个单词。
for qa in paragraph["qas"]:
qas_id = qa["id"]
question_text = qa["question"]
start_position = None
end_position = None
orig_answer_text = None
is_impossible = False
if is_training:
if FLAGS.version_2_with_negative:
is_impossible = qa["is_impossible"]
if (len(qa["answers"]) != 1) and (not is_impossible):
raise ValueError(
"For training, each question should have exactly 1 answer.")
if not is_impossible:
answer = qa["answers"][0]
orig_answer_text = answer["text"]
answer_offset = answer["answer_start"]
answer_length = len(orig_answer_text)
start_position = char_to_word_offset[answer_offset]
end_position = char_to_word_offset[answer_offset + answer_length -
1]
actual_text = " ".join(
doc_tokens[start_position:(end_position + 1)])
cleaned_answer_text = " ".join(
tokenization.whitespace_tokenize(orig_answer_text))
if actual_text.find(cleaned_answer_text) == -1:
tf.logging.warning("Could not find answer: '%s' vs. '%s'",
actual_text, cleaned_answer_text)
continue
else:
start_position = -1
end_position = -1
orig_answer_text = ""
这一段就是将答案按照单词的方式进行切分,将答案的起始位置和终止位置转换成新的位置,具体example的格式如下:
入口函数在
def convert_examples_to_features(examples, tokenizer, max_seq_length,
doc_stride, max_query_length, is_training,
output_fn):
"""Loads a data file into a list of `InputBatch`s."""
下面涉及的tokenizer请翻阅【Bert】(四)句子关系判断--源码解析(解析数据)_mjiansun的专栏-CSDN博客
query_tokens = tokenizer.tokenize(example.question_text)
# max_query_length表示查询的句子长度的最大值,此处设置为64
if len(query_tokens) > max_query_length:
query_tokens = query_tokens[0:max_query_length]
通过tokenizer将词语进行完整分词,限制分词后问题语句的最大长度,此处设置为64。
tok_to_orig_index = []
orig_to_tok_index = []
all_doc_tokens = []
for (i, token) in enumerate(example.doc_tokens):
orig_to_tok_index.append(len(all_doc_tokens))
sub_tokens = tokenizer.tokenize(token)
for sub_token in sub_tokens:
tok_to_orig_index.append(i)
all_doc_tokens.append(sub_token)
由于有一些单词在vocab中不存在,所以需要将原始单词切分成在vocab中存在的单词和符号,这些新词称为切分后的词,具体方法参考【Bert】(四)句子关系判断--源码解析(解析数据)_mjiansun的专栏-CSDN博客。
具体介绍下下面三个变量:
all_doc_tokens:形状为n, 存储了经过tokenizer方法切分后单词和字符列表。
tok_to_orig_index:形状为n, 切分后的词属于原始单词在原始句子example中对应的索引值。
orig_to_tok_index:形状为m,m小于等于n,其长度等于原始句子example长度,表示原始单词在all_doc_tokens中的起始位置
例如
menzies called a conference
all_doc_tokens的结果为['men', '##zie', '##s', 'called', 'a', 'conference']
tok_to_orig_index的结果为[0, 0, 0, 1, 2, 3]
orig_to_tok_index的结果为[0, 3, 4, 5]
tok_start_position = None
tok_end_position = None
if is_training and example.is_impossible:
tok_start_position = -1
tok_end_position = -1
if is_training and not example.is_impossible:
tok_start_position = orig_to_tok_index[example.start_position]
if example.end_position < len(example.doc_tokens) - 1:
tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
else:
tok_end_position = len(all_doc_tokens) - 1
(tok_start_position, tok_end_position) = _improve_answer_span(
all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
example.orig_answer_text)
这句代码很有意思:
tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
按照我们上面的理解,只需要取出orig_to_tok_index对应位置的值,就表示了在all_doc_tokens新的单词列表中的索引值,为何要有一个+1和-1的操作呢?
这是防止在结束的这个原始单词被切分成了多个词,而orig_to_tok_index中仅仅记录了该单词的起始位置,没有终止位置,所以需要取原始单词终止位置后面一个位置的值,然后将该值减去1,就可以表示真实的终止位置。
(tok_start_position, tok_end_position) = _improve_answer_span(
all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
example.orig_answer_text)
这段其实就是为了防止答案是某部分单词,导致在词汇表中不存在的情况。
例如
# However, this is not always possible. Consider the following: # # Question: What country is the top exporter of electornics? # Context: The Japanese electronics industry is the lagest in the world. # Answer: Japan # # In this case, the annotator chose "Japan" as a character sub-span of # the word "Japanese". Since our WordPiece tokenizer does not split # "Japanese", we just use "Japanese" as the annotation. This is fairly rare # in SQuAD, but does happen.
# The -3 accounts for [CLS], [SEP] and [SEP]
max_tokens_for_doc = max_seq_length - len(query_tokens) - 3
上面的代码表示:假设限定的长度为384,那么他要减去3,因为他要在3个位置添加标志符,分别为[CLS], [SEP]和[SEP],[CLS]包含全段落信息,[SEP]第一段话的结束符,[SEP]第二段话的结束符。
_DocSpan = collections.namedtuple( # pylint: disable=invalid-name
"DocSpan", ["start", "length"])
doc_spans = []
start_offset = 0
while start_offset < len(all_doc_tokens):
length = len(all_doc_tokens) - start_offset
if length > max_tokens_for_doc:
length = max_tokens_for_doc
doc_spans.append(_DocSpan(start=start_offset, length=length))
if start_offset + length == len(all_doc_tokens):
break
start_offset += min(length, doc_stride)
上面这段就是当段落超过规定长度时的处理方式,当长度超过了上面的384-3=381时,需要对段落进行滑动窗取段落,滑动窗步长为doc_stride,这里设定为128。
for (doc_span_index, doc_span) in enumerate(doc_spans):
# 添加问题单词列表
tokens = []
token_to_orig_map = {}
token_is_max_context = {}
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in query_tokens:
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)
# 添加段落单词列表
for i in range(doc_span.length):
split_token_index = doc_span.start + i
token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]
is_max_context = _check_is_max_context(doc_spans, doc_span_index,
split_token_index)
token_is_max_context[len(tokens)] = is_max_context
tokens.append(all_doc_tokens[split_token_index])
segment_ids.append(1)
tokens.append("[SEP]")
segment_ids.append(1)
上面这段分别为添加问题单词列表和添加段落单词列表。
tokens:列表,最终完整的tokens,这时还未padding。
token_to_orig_map:字典,键表示tokens中的位置,值表示原始句子example中的位置。
token_is_max_context:字典,键表示tokens中的位置,值表示该tokens中对应位置的单词或符号是否应该在当前句子中。
针对token_is_max_context做一点解释:
1)段落太长需要滑动窗
之前描述的输入的句子长度为384,但是很有可能问题+段落的长度大于了384,那么这时候就会使用滑动窗的方式来截取段落,假设段落长度430,问题长度11,滑动窗步长128,那么问题+段落的结合就会出现2个输入句子,第一个句子中段落部分(384-3)-11=370,第二个句子段落部分(min(129+370,430)-129+1)=302,由于已经达到最长段落,所以不需要继续切分段落。因此一个问题+段落被切分成如下形式:
句子1:384
句子2:302+11+3+padding
2)重复单词归属问题
那么这时候涉及一个问题,由于滑动步长的缘故,有一部分单词出现了重复,那么这部分重复的单词到底是属于句子1呢还是属于句子2呢?
针对单词的归属问题可以直接看下面的例子,例子中加了一些注释,很容易理解。
# Because of the sliding window approach taken to scoring documents, a single
# token can appear in multiple documents. E.g.
# Doc: the man went to the store and bought a gallon of milk
# Span A: the man went to the
# Span B: to the store and bought
# Span C: and bought a gallon of
# ...
#
# Now the word 'bought' will have two scores from spans B and C. We only
# want to consider the score with "maximum context", which we define as
# the *minimum* of its left and right context (the *sum* of left and
# right context will always be the same, of course).# In the example the maximum context for 'bought' would be span C since
# it has 1 left context and 3 right context, while span B has 4 left context
# and 0 right context.目标单词距头尾距离left_score和right_score,选取小的作为目标单词在改句子中的得分。以此方法,目标单词在多个句子中都会有一个得分,那么比较该单词得分的大小,最大的那个就认为目标单词属于那个句子。
其实上面例子的本质就是单词越在句子中间,那么单词得分就越高,也就表示单词属于那个句子的权重越高。
3)代码实现辅助理解逻辑
下面代码是判断该单词或符号是否应该在当前句子中。
def _check_is_max_context(doc_spans, cur_span_index, position): """Check if this is the 'max context' doc span for the token.""" best_score = None best_span_index = None for (span_index, doc_span) in enumerate(doc_spans): end = doc_span.start + doc_span.length - 1 if position < doc_span.start: continue if position > end: continue num_left_context = position - doc_span.start num_right_context = end - position score = min(num_left_context, num_right_context) + 0.01 * doc_span.length if best_score is None or score > best_score: best_score = score best_span_index = span_index return cur_span_index == best_span_index
segment_ids:列表,长度与tokens相同,此时还未padding,0表示问题,1表示段落。
input_ids = tokenizer.convert_tokens_to_ids(tokens)
将单词对应到词汇表中的索引值,形成一堆索引值,代表了句子向量。
input_mask = [1] * len(input_ids)
input_mask:哪些是真实的句子,哪些是padding的部分
# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)
将其padding到指定长度,这里长度为384。
start_position = None
end_position = None
if is_training and not example.is_impossible:
# For training, if our document chunk does not contain an annotation
# we throw it out, since there is nothing to predict.
doc_start = doc_span.start
doc_end = doc_span.start + doc_span.length - 1
out_of_span = False
if not (tok_start_position >= doc_start and
tok_end_position <= doc_end):
out_of_span = True
if out_of_span:
start_position = 0
end_position = 0
else:
doc_offset = len(query_tokens) + 2 #这里的加2是由于加了[CLS]和[SEP]
start_position = tok_start_position - doc_start + doc_offset
end_position = tok_end_position - doc_start + doc_offset
确定在tokens中答案的起始位置和终止位置。
feature = InputFeatures(
unique_id=unique_id,
example_index=example_index,
doc_span_index=doc_span_index,
tokens=tokens,
token_to_orig_map=token_to_orig_map,
token_is_max_context=token_is_max_context,
input_ids=input_ids,
input_mask=input_mask,
segment_ids=segment_ids,
start_position=start_position,
end_position=end_position,
is_impossible=example.is_impossible)
# Run callback
output_fn(feature)
这里就形成了bert需要的数据。
这里的output_fn是保存为tfrecord。
def process_feature(self, feature):
"""Write a InputFeature to the TFRecordWriter as a tf.train.Example."""
self.num_features += 1
def create_int_feature(values):
feature = tf.train.Feature(
int64_list=tf.train.Int64List(value=list(values)))
return feature
features = collections.OrderedDict()
features["unique_ids"] = create_int_feature([feature.unique_id])
features["input_ids"] = create_int_feature(feature.input_ids)
features["input_mask"] = create_int_feature(feature.input_mask)
features["segment_ids"] = create_int_feature(feature.segment_ids)
if self.is_training:
features["start_positions"] = create_int_feature([feature.start_position])
features["end_positions"] = create_int_feature([feature.end_position])
impossible = 0
if feature.is_impossible:
impossible = 1
features["is_impossible"] = create_int_feature([impossible])
tf_example = tf.train.Example(features=tf.train.Features(feature=features))
self._writer.write(tf_example.SerializeToString())