【Bert】(十)简易问答系统--数据解析

论文:https://arxiv.org/pdf/1810.04805.pdf

官方代码:GitHub - google-research/bert: TensorFlow code and pre-trained models for BERT

1. 读取数据

在run_squad.py的read_squad_examples(input_file,is_training)函数是处理train-v1.1.json文件的。

如果阅读过【Bert】(八)简易问答系统--数据介绍及标注_mjiansun的专栏-CSDN博客这篇博客,应该大致了解了train-v1.1.json的大致格式,在有了这个概念的基础上,下面的内容更容易理解。

      paragraph_text = paragraph["context"]
      doc_tokens = [] #存储使用空格分开的单词
      char_to_word_offset = [] #由于原始的起始位置是根据字符来得,这里是为了记录字符和单词之间的对应关系
      prev_is_whitespace = True #前面一个单词是否是空格
      for c in paragraph_text:
        if is_whitespace(c):
          prev_is_whitespace = True
        else:
          if prev_is_whitespace:
            doc_tokens.append(c)
          else:
            doc_tokens[-1] += c
          prev_is_whitespace = False
        char_to_word_offset.append(len(doc_tokens) - 1)

上述代码是将一段话通过空格进行分词,然后通过char_to_word_offset来记录字符与单词之间的对应关系,数组长度与一段话的字符的总长度相等,内部的取值表明是该字符属于第几个单词。

备注:空格属于空格前的一个单词。


      for qa in paragraph["qas"]:
        qas_id = qa["id"]
        question_text = qa["question"]
        start_position = None
        end_position = None
        orig_answer_text = None
        is_impossible = False
        if is_training:

          if FLAGS.version_2_with_negative:
            is_impossible = qa["is_impossible"]
          if (len(qa["answers"]) != 1) and (not is_impossible):
            raise ValueError(
                "For training, each question should have exactly 1 answer.")
          if not is_impossible:
            answer = qa["answers"][0]
            orig_answer_text = answer["text"]
            answer_offset = answer["answer_start"]
            answer_length = len(orig_answer_text)
            start_position = char_to_word_offset[answer_offset]
            end_position = char_to_word_offset[answer_offset + answer_length -
                                               1]
            actual_text = " ".join(
                doc_tokens[start_position:(end_position + 1)])
            cleaned_answer_text = " ".join(
                tokenization.whitespace_tokenize(orig_answer_text))
            if actual_text.find(cleaned_answer_text) == -1:
              tf.logging.warning("Could not find answer: '%s' vs. '%s'",
                                 actual_text, cleaned_answer_text)
              continue
          else:
            start_position = -1
            end_position = -1
            orig_answer_text = ""

这一段就是将答案按照单词的方式进行切分,将答案的起始位置和终止位置转换成新的位置,具体example的格式如下:

2. 数据处理

入口函数在

def convert_examples_to_features(examples, tokenizer, max_seq_length,
                                 doc_stride, max_query_length, is_training,
                                 output_fn):
  """Loads a data file into a list of `InputBatch`s."""

下面涉及的tokenizer请翻阅【Bert】(四)句子关系判断--源码解析(解析数据)_mjiansun的专栏-CSDN博客


2.1 限制Question问题语句长度

    query_tokens = tokenizer.tokenize(example.question_text)
    # max_query_length表示查询的句子长度的最大值,此处设置为64
    if len(query_tokens) > max_query_length:
      query_tokens = query_tokens[0:max_query_length]

通过tokenizer将词语进行完整分词,限制分词后问题语句的最大长度,此处设置为64。

2.2 将段落使用tokenizer切分为单词列表

    tok_to_orig_index = []
    orig_to_tok_index = []
    all_doc_tokens = []
    for (i, token) in enumerate(example.doc_tokens):
      orig_to_tok_index.append(len(all_doc_tokens))
      sub_tokens = tokenizer.tokenize(token)
      for sub_token in sub_tokens:
        tok_to_orig_index.append(i)
        all_doc_tokens.append(sub_token)

由于有一些单词在vocab中不存在,所以需要将原始单词切分成在vocab中存在的单词和符号,这些新词称为切分后的词,具体方法参考【Bert】(四)句子关系判断--源码解析(解析数据)_mjiansun的专栏-CSDN博客。

具体介绍下下面三个变量:

all_doc_tokens:形状为n, 存储了经过tokenizer方法切分后单词和字符列表。

tok_to_orig_index:形状为n, 切分后的词属于原始单词在原始句子example中对应的索引值。

orig_to_tok_index:形状为m,m小于等于n,其长度等于原始句子example长度,表示原始单词all_doc_tokens中的起始位置

例如

menzies called a conference

all_doc_tokens的结果为['men', '##zie', '##s', 'called', 'a', 'conference']

tok_to_orig_index的结果为[0, 0, 0, 1, 2, 3]

orig_to_tok_index的结果为[0, 3, 4, 5]

2.3 起始位置和终止位置转换

    tok_start_position = None
    tok_end_position = None
    if is_training and example.is_impossible:
      tok_start_position = -1
      tok_end_position = -1
    if is_training and not example.is_impossible:
      tok_start_position = orig_to_tok_index[example.start_position]
      if example.end_position < len(example.doc_tokens) - 1:
        tok_end_position = orig_to_tok_index[example.end_position + 1] - 1
      else:
        tok_end_position = len(all_doc_tokens) - 1
      (tok_start_position, tok_end_position) = _improve_answer_span(
          all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
          example.orig_answer_text)

这句代码很有意思:

tok_end_position = orig_to_tok_index[example.end_position + 1] - 1

按照我们上面的理解,只需要取出orig_to_tok_index对应位置的值,就表示了在all_doc_tokens新的单词列表中的索引值,为何要有一个+1和-1的操作呢?

这是防止在结束的这个原始单词被切分成了多个词,而orig_to_tok_index中仅仅记录了该单词的起始位置,没有终止位置,所以需要取原始单词终止位置后面一个位置的值,然后将该值减去1,就可以表示真实的终止位置。


      (tok_start_position, tok_end_position) = _improve_answer_span(
          all_doc_tokens, tok_start_position, tok_end_position, tokenizer,
          example.orig_answer_text)

这段其实就是为了防止答案是某部分单词,导致在词汇表中不存在的情况。

例如

# However, this is not always possible. Consider the following:
#
#   Question: What country is the top exporter of electornics?
#   Context: The Japanese electronics industry is the lagest in the world.
#   Answer: Japan
#
# In this case, the annotator chose "Japan" as a character sub-span of
# the word "Japanese". Since our WordPiece tokenizer does not split
# "Japanese", we just use "Japanese" as the annotation. This is fairly rare
# in SQuAD, but does happen.

2.4 段落超过规定长度时的处理方式

    # The -3 accounts for [CLS], [SEP] and [SEP]
    max_tokens_for_doc = max_seq_length - len(query_tokens) - 3

上面的代码表示:假设限定的长度为384,那么他要减去3,因为他要在3个位置添加标志符,分别为[CLS], [SEP]和[SEP],[CLS]包含全段落信息,[SEP]第一段话的结束符,[SEP]第二段话的结束符。


    _DocSpan = collections.namedtuple(  # pylint: disable=invalid-name
        "DocSpan", ["start", "length"])
    doc_spans = []
    start_offset = 0
    while start_offset < len(all_doc_tokens):
      length = len(all_doc_tokens) - start_offset
      if length > max_tokens_for_doc:
        length = max_tokens_for_doc
      doc_spans.append(_DocSpan(start=start_offset, length=length))
      if start_offset + length == len(all_doc_tokens):
        break
      start_offset += min(length, doc_stride)

上面这段就是当段落超过规定长度时的处理方式,当长度超过了上面的384-3=381时,需要对段落进行滑动窗取段落,滑动窗步长为doc_stride,这里设定为128。

2.5 构建输入bert的数据

    for (doc_span_index, doc_span) in enumerate(doc_spans):
      # 添加问题单词列表
      tokens = []
      token_to_orig_map = {}
      token_is_max_context = {}
      segment_ids = []
      tokens.append("[CLS]")
      segment_ids.append(0)
      for token in query_tokens:
        tokens.append(token)
        segment_ids.append(0)
      tokens.append("[SEP]")
      segment_ids.append(0)

      # 添加段落单词列表
      for i in range(doc_span.length):
        split_token_index = doc_span.start + i
        token_to_orig_map[len(tokens)] = tok_to_orig_index[split_token_index]

        is_max_context = _check_is_max_context(doc_spans, doc_span_index,
                                               split_token_index)
        token_is_max_context[len(tokens)] = is_max_context
        tokens.append(all_doc_tokens[split_token_index])
        segment_ids.append(1)
      tokens.append("[SEP]")
      segment_ids.append(1)

上面这段分别为添加问题单词列表和添加段落单词列表。

tokens:列表,最终完整的tokens,这时还未padding。
token_to_orig_map:字典,表示tokens中的位置,表示原始句子example中的位置。
token_is_max_context:字典,表示tokens中的位置,表示该tokens中对应位置的单词或符号是否应该在当前句子中。

针对token_is_max_context做一点解释:

1)段落太长需要滑动窗

之前描述的输入的句子长度为384,但是很有可能问题+段落的长度大于了384,那么这时候就会使用滑动窗的方式来截取段落,假设段落长度430,问题长度11,滑动窗步长128,那么问题+段落的结合就会出现2个输入句子,第一个句子中段落部分(384-3)-11=370,第二个句子段落部分(min(129+370,430)-129+1)=302,由于已经达到最长段落,所以不需要继续切分段落。因此一个问题+段落被切分成如下形式:

句子1:384

句子2:302+11+3+padding

2)重复单词归属问题

那么这时候涉及一个问题,由于滑动步长的缘故,有一部分单词出现了重复,那么这部分重复的单词到底是属于句子1呢还是属于句子2呢

针对单词的归属问题可以直接看下面的例子,例子中加了一些注释,很容易理解。
  # Because of the sliding window approach taken to scoring documents, a single
  # token can appear in multiple documents. E.g.
  #  Doc: the man went to the store and bought a gallon of milk
  #  Span A: the man went to the
  #  Span B: to the store and bought
  #  Span C: and bought a gallon of
  #  ...
  #
  # Now the word 'bought' will have two scores from spans B and C. We only
  # want to consider the score with "maximum context", which we define as
  # the *minimum* of its left and right context (the *sum* of left and
  # right context will always be the same, of course).

  # In the example the maximum context for 'bought' would be span C since
  # it has 1 left context and 3 right context, while span B has 4 left context
  # and 0 right context.

目标单词距头尾距离left_score和right_score,选取小的作为目标单词在改句子中的得分。以此方法,目标单词在多个句子中都会有一个得分,那么比较该单词得分的大小,最大的那个就认为目标单词属于那个句子。

其实上面例子的本质就是单词越在句子中间,那么单词得分就越高,也就表示单词属于那个句子的权重越高

3)代码实现辅助理解逻辑

下面代码是判断该单词或符号是否应该在当前句子中。 

def _check_is_max_context(doc_spans, cur_span_index, position):
  """Check if this is the 'max context' doc span for the token."""

  best_score = None
  best_span_index = None
  for (span_index, doc_span) in enumerate(doc_spans):
    end = doc_span.start + doc_span.length - 1
    if position < doc_span.start:
      continue
    if position > end:
      continue
    num_left_context = position - doc_span.start
    num_right_context = end - position
    score = min(num_left_context, num_right_context) + 0.01 * doc_span.length
    if best_score is None or score > best_score:
      best_score = score
      best_span_index = span_index

  return cur_span_index == best_span_index

segment_ids:列表,长度与tokens相同,此时还未padding,0表示问题,1表示段落。


input_ids = tokenizer.convert_tokens_to_ids(tokens)

将单词对应到词汇表中的索引值,形成一堆索引值,代表了句子向量。


input_mask = [1] * len(input_ids)

input_mask:哪些是真实的句子,哪些是padding的部分


      # Zero-pad up to the sequence length.
      while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

将其padding到指定长度,这里长度为384。


      start_position = None
      end_position = None
      if is_training and not example.is_impossible:
        # For training, if our document chunk does not contain an annotation
        # we throw it out, since there is nothing to predict.
        doc_start = doc_span.start
        doc_end = doc_span.start + doc_span.length - 1
        out_of_span = False
        if not (tok_start_position >= doc_start and
                tok_end_position <= doc_end):
          out_of_span = True
        if out_of_span:
          start_position = 0
          end_position = 0
        else:
          doc_offset = len(query_tokens) + 2 #这里的加2是由于加了[CLS]和[SEP]
          start_position = tok_start_position - doc_start + doc_offset
          end_position = tok_end_position - doc_start + doc_offset

确定在tokens中答案的起始位置和终止位置。


      feature = InputFeatures(
          unique_id=unique_id,
          example_index=example_index,
          doc_span_index=doc_span_index,
          tokens=tokens,
          token_to_orig_map=token_to_orig_map,
          token_is_max_context=token_is_max_context,
          input_ids=input_ids,
          input_mask=input_mask,
          segment_ids=segment_ids,
          start_position=start_position,
          end_position=end_position,
          is_impossible=example.is_impossible)

      # Run callback
      output_fn(feature)

这里就形成了bert需要的数据。

这里的output_fn是保存为tfrecord。

  def process_feature(self, feature):
    """Write a InputFeature to the TFRecordWriter as a tf.train.Example."""
    self.num_features += 1

    def create_int_feature(values):
      feature = tf.train.Feature(
          int64_list=tf.train.Int64List(value=list(values)))
      return feature

    features = collections.OrderedDict()
    features["unique_ids"] = create_int_feature([feature.unique_id])
    features["input_ids"] = create_int_feature(feature.input_ids)
    features["input_mask"] = create_int_feature(feature.input_mask)
    features["segment_ids"] = create_int_feature(feature.segment_ids)

    if self.is_training:
      features["start_positions"] = create_int_feature([feature.start_position])
      features["end_positions"] = create_int_feature([feature.end_position])
      impossible = 0
      if feature.is_impossible:
        impossible = 1
      features["is_impossible"] = create_int_feature([impossible])

    tf_example = tf.train.Example(features=tf.train.Features(feature=features))
    self._writer.write(tf_example.SerializeToString())

你可能感兴趣的:(NLP,bert,深度学习,transformer)