得克特

Bert源码学习

文章目录

前言
1. bert模型网络modeling.py

1.1 整体架构 BertModel(object):
1.2 embedding层

1.2.1 embedding_lookup
1.2.2 词向量处理 embedding_postprocessor

1.3 编码层 Encoder Layer

1.3.1 注意力层 Attention Layer
1.3.2 核心：Transformer

1.4 pooler

2. optimization.py 梯度更新

2.1 Adam参数更新
2.2 optimizer返回梯度更新执行单元

3. 模型评估

3.1 metrics_from_confusion_matrix
3.2 precision
3.3 recall
3.4 F1

4. 数据准备

4.1 原文本
4.2 处理逻辑

5. 预训练 pre-training

5.1 主函数 main()
5.2 模型构建 model_fn_builder
5.3 构建训练集 input_fn_builder

前言

网上关于bert的介绍文章有很多，不乏相当优秀的文章，只是大部分偏重理论没有代码，看起来总觉得少点什么，最近正好看相关代码，结合理论记录一下理解的理论和疑问。很多东西也是没办法一次看全的，所以这个笔记会不断的更新。

有以下几个关键点先记住：

1.Bert的编码层采用transformer的decoder部分（多头双向编码器），如果要看代码可以参考 Transformer 代码详解
2.Bert训练的双向模型，其应用的Masked Language Model是随机将15%的单词替换：
80％的时间：用[MASK]标记替换单词，例如，my dog is hairy → my dog is [MASK]
10％的时间：用一个随机的单词替换该单词，例如，my dog is hairy → my dog is apple
10％的时间：保持单词不变，例如，my dog is hairy → my dog is hairy. 这样做的目的是将表示偏向于实际观察到的单词。
3.Bert可以预测下一句，增加了计算下一句的loss，并且其训练输入是基于上下两句，训练的目的是预测是否是连续的两条上下文语句。
4.上述的2和3就是Bert模型的预训练过程，训练出通用模型，然后再根据具体应用，用supervised训练数据。

1. bert模型网络modeling.py

1.1 整体架构 BertModel(object):

首先来看bert模型的模型架构：
在bert变量空间内，包含了embedding层、encoder层和pooler层
embedding层的第一部分通过[vocab_size, embedding_size]维度的embedding_table生成向量

第二部分传递token_type和学习到的位置编码.
encoder层包含创建MASK、多头编码器(包含Attention)
pooler取encoder的输出seq_length的第一行进行全连接输出。

 with tf.variable_scope(scope, default_name="bert"):
      with tf.variable_scope("embeddings"):
        # Perform embedding lookup on the word ids.
        (self.embedding_output, self.embedding_table) = embedding_lookup(
            input_ids=input_ids,
            vocab_size=config.vocab_size,
            embedding_size=config.hidden_size,
            initializer_range=config.initializer_range,
            word_embedding_name="word_embeddings",
            use_one_hot_embeddings=use_one_hot_embeddings)

        # Add positional embeddings and token type embeddings, then layer
        # normalize and perform dropout.
        self.embedding_output = embedding_postprocessor(
            input_tensor=self.embedding_output,
            use_token_type=True,
            token_type_ids=token_type_ids,
            token_type_vocab_size=config.type_vocab_size,
            token_type_embedding_name="token_type_embeddings",
            use_position_embeddings=True,
            position_embedding_name="position_embeddings",
            initializer_range=config.initializer_range,
            max_position_embeddings=config.max_position_embeddings,
            dropout_prob=config.hidden_dropout_prob)

      with tf.variable_scope("encoder"):
        # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
        # mask of shape [batch_size, seq_length, seq_length] which is used
        # for the attention scores.
        attention_mask = create_attention_mask_from_input_mask(
            input_ids, input_mask)

        # Run the stacked transformer.
        # `sequence_output` shape = [batch_size, seq_length, hidden_size].
        self.all_encoder_layers = transformer_model(
            input_tensor=self.embedding_output,
            attention_mask=attention_mask,
            hidden_size=config.hidden_size,
            num_hidden_layers=config.num_hidden_layers,
            num_attention_heads=config.num_attention_heads,
            intermediate_size=config.intermediate_size,
            intermediate_act_fn=get_activation(config.hidden_act),
            hidden_dropout_prob=config.hidden_dropout_prob,
            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
            initializer_range=config.initializer_range,
            do_return_all_layers=True)

      self.sequence_output = self.all_encoder_layers[-1]
      # The "pooler" converts the encoded sequence tensor of shape
      # [batch_size, seq_length, hidden_size] to a tensor of shape
      # [batch_size, hidden_size]. This is necessary for segment-level
      # (or segment-pair-level) classification tasks where we need a fixed
      # dimensional representation of the segment.
      with tf.variable_scope("pooler"):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token. We assume that this has been pre-trained
        # 取第一行，然后压缩
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

1.2 embedding层

1.2.1 embedding_lookup

embedding_lookup初始化embedding_table来生成向量，这里不复杂。
[batch_size, seq_length]->[batch_size, seq_length, embedding_size]
介绍一下几个重点参数

vocab_size 原本词汇量大小
embedding_size 生成的embedding_table的维度，实际就是hidden_size，bert默认大小768
initializer_range 用于embedding_table的权重初始化

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
  """Looks up words embeddings for id tensor.

  Args:
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
      ids.
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.gather()`.

  Returns:
    float Tensor of shape [batch_size, seq_length, embedding_size].
  """
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])

  embedding_table = tf.get_variable(
      name=word_embedding_name,
      shape=[vocab_size, embedding_size],
      initializer=create_initializer(initializer_range))

  flat_input_ids = tf.reshape(input_ids, [-1])
  if use_one_hot_embeddings:
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
  else:
    output = tf.gather(embedding_table, flat_input_ids)

  input_shape = get_shape_list(input_ids)

  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  return (output, embedding_table)

1.2.2 词向量处理 embedding_postprocessor

向量预处理：添加token_type和位置向量。
[batch_size, seq_length, embedding_size]->[batch_size, seq_length,embedding_size] 输入和输出维度不变

这里要注意token_type也是通过token_type_table做的向量化处理，生成的向量与字向量的维度相同（width）
模型训练一个全位置向量，这里的全是指最长的单词长度对应的位置向量，后面会根据实际seq_length（句子长度）截取所需要的长度。位置向量对于不同batch相同seq_length的句子都是相同的，所以这里是在[seq_length,width]维度的加，前面维度置为1。

def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
  """Performs various post-processing on a word embedding tensor.

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length,
      embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.

  Returns:
    float tensor with same shape as `input_tensor`.

  Raises:
    ValueError: One of the tensor shapes or input values is invalid.
  """
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  width = input_shape[2]

  output = input_tensor

  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        name=token_type_embedding_name,
        shape=[token_type_vocab_size, width],
        initializer=create_initializer(initializer_range))
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings

  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          name=position_embedding_name,
          shape=[max_position_embeddings, width],
          initializer=create_initializer(initializer_range))
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.

      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      # 切片 [[0,seq_length],[0,-1]]
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())

      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.因为只有后两个维度与位置相关，batch与位置无关
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
        position_broadcast_shape.append(1)
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
                                       position_broadcast_shape)
      output += position_embeddings

  output = layer_norm_and_dropout(output, dropout_prob)
  return output

1.3 编码层 Encoder Layer

生成attention_mask用于transformer的生成，编码层主要执行的transformer的encoder

1.3.1 注意力层 Attention Layer

首先通过全连接层构建Q,K,V，这里Q,K,V的维度都是隐藏层的sizehead_size*num_attention_heads，后续会将hidden_size reshape为多头（head_size和num_attention_heads两个维度）来计算每一个head的下图公式的值。这里要注意K,V的维度可能与Q不同，这里取决于输入的from_tensor和to_tensor是否相同，因为Q是通过from_tensor全连接计算而得，K，V是通过to_tensor计算所得。不过就输出而言，肯定是与from_tensor维度相同的。最后将多头reshape为hidden_size.

我们来详细理解下上述公式，主要计算Attention score，获取softmax分数，这个分数表示每个单词对当前单词的贡献。将每个值向量乘以softmax分数（这里是希望关注语义上相关的单词，弱化不相关的单词）加权求和，即得到自注意力层在该位置的输出。

1.3.2 核心：Transformer

前面提到，编码层主要采用的是transformer模型的编码层，关键词：多头，多层
[batch_size, seq_length, hidden_size]->[batch_size, seq_length, hidden_size]
模型有几个关键字段：

head_size=隐藏层大小（hidden_size）/ num_attention_heads
hidden_size=input_width Transformer会将残差连接，所以输入的width需要与hidden_size相同。
隐藏层层数为num_hidden_layer，每一层包含attention层、intermediate层和output层。这里需要注意的是在attention层，多头输出经过连接和线性变换后加上residual（这一层的输入）

def transformer_model(input_tensor,
                      attention_mask=None,
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
  """Multi-headed, multi-layer Transformer from "Attention is All You Need".

  This is almost an exact implementation of the original Transformer encoder.

  See the original paper:
  https://arxiv.org/abs/1706.03762

  Also see:
  https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py

  Args:
    input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
    attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
      seq_length], with 1 for positions that can be attended to and 0 in
      positions that should not be.
    hidden_size: int. Hidden size of the Transformer.
    num_hidden_layers: int. Number of layers (blocks) in the Transformer.
    num_attention_heads: int. Number of attention heads in the Transformer.
    intermediate_size: int. The size of the "intermediate" (a.k.a., feed
      forward) layer.
    intermediate_act_fn: function. The non-linear activation function to apply
      to the output of the intermediate/feed-forward layer.
    hidden_dropout_prob: float. Dropout probability for the hidden layers.
    attention_probs_dropout_prob: float. Dropout probability of the attention
      probabilities.
    initializer_range: float. Range of the initializer (stddev of truncated
      normal).
    do_return_all_layers: Whether to also return all layers or just the final
      layer.

  Returns:
    float Tensor of shape [batch_size, seq_length, hidden_size], the final
    hidden layer of the Transformer.

  Raises:
    ValueError: A Tensor shape or parameter is invalid.
  """
  if hidden_size % num_attention_heads != 0:
    raise ValueError(
        "The hidden size (%d) is not a multiple of the number of attention "
        "heads (%d)" % (hidden_size, num_attention_heads))

  attention_head_size = int(hidden_size / num_attention_heads)
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]
  seq_length = input_shape[1]
  input_width = input_shape[2]

  # The Transformer performs sum residuals on all layers so the input needs
  # to be the same as the hidden size.
  if input_width != hidden_size:
    raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                     (input_width, hidden_size))

  # We keep the representation as a 2D tensor to avoid re-shaping it back and
  # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
  # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
  # help the optimizer.
  prev_output = reshape_to_matrix(input_tensor)

  all_layer_outputs = []
  for layer_idx in range(num_hidden_layers):
    with tf.variable_scope("layer_%d" % layer_idx):
      layer_input = prev_output

      with tf.variable_scope("attention"):
        attention_heads = []
        with tf.variable_scope("self"):
          attention_head = attention_layer(
              from_tensor=layer_input,
              to_tensor=layer_input,
              attention_mask=attention_mask,
              num_attention_heads=num_attention_heads,
              size_per_head=attention_head_size,
              attention_probs_dropout_prob=attention_probs_dropout_prob,
              initializer_range=initializer_range,
              do_return_2d_tensor=True,
              batch_size=batch_size,
              from_seq_length=seq_length,
              to_seq_length=seq_length)
          attention_heads.append(attention_head)

        attention_output = None
        if len(attention_heads) == 1:
          attention_output = attention_heads[0]
        else:
          # In the case where we have other sequences, we just concatenate
          # them to the self-attention head before the projection.
          attention_output = tf.concat(attention_heads, axis=-1)

        # Run a linear projection of `hidden_size` then add a residual
        # with `layer_input`.
        with tf.variable_scope("output"):
          attention_output = tf.layers.dense(
              attention_output,
              hidden_size,
              kernel_initializer=create_initializer(initializer_range))
          attention_output = dropout(attention_output, hidden_dropout_prob)
          attention_output = layer_norm(attention_output + layer_input)

      # The activation is only applied to the "intermediate" hidden layer.
      with tf.variable_scope("intermediate"):
        intermediate_output = tf.layers.dense(
            attention_output,
            intermediate_size,
            activation=intermediate_act_fn,
            kernel_initializer=create_initializer(initializer_range))

      # Down-project back to `hidden_size` then add the residual.
      with tf.variable_scope("output"):
        layer_output = tf.layers.dense(
            intermediate_output,
            hidden_size,
            kernel_initializer=create_initializer(initializer_range))
        layer_output = dropout(layer_output, hidden_dropout_prob)
        layer_output = layer_norm(layer_output + attention_output)
        prev_output = layer_output
        all_layer_outputs.append(layer_output)

  if do_return_all_layers:
    final_outputs = []
    for layer_output in all_layer_outputs:
      final_output = reshape_from_matrix(layer_output, input_shape)
      final_outputs.append(final_output)
    return final_outputs
  else:
    final_output = reshape_from_matrix(prev_output, input_shape)
    return final_output

1.4 pooler

pooler层是对编码器的输出进行处理，这里只取了seq_length维度的第一行，然后压缩该维度：

[batch_size, seq_length, hidden_size] => [batch_size, hidden_size]

最后经过一个线性激活层输出。

 with tf.variable_scope("pooler"):
        # We "pool" the model by simply taking the hidden state corresponding
        # to the first token. We assume that this has been pre-trained
        # 取第一行，然后压缩
        first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
        self.pooled_output = tf.layers.dense(
            first_token_tensor,
            config.hidden_size,
            activation=tf.tanh,
            kernel_initializer=create_initializer(config.initializer_range))

2. optimization.py 梯度更新

整体返回反向传播的执行单元train_op，用于tf.session执行

2.1 Adam参数更新

继承自Optimizer类，主要的函数是apply_gradients，通过梯度使用Adam算法更新模型参数，具体怎么使用我们下面来看反向传播的函数。

class AdamWeightDecayOptimizer(tf.train.Optimizer):
  """A basic Adam optimizer that includes "correct" L2 weight decay."""

  def __init__(self,
               learning_rate,
               weight_decay_rate=0.0,
               beta_1=0.9,
               beta_2=0.999,
               epsilon=1e-6,
               exclude_from_weight_decay=None,
               name="AdamWeightDecayOptimizer"):
    """Constructs a AdamWeightDecayOptimizer."""
    super(AdamWeightDecayOptimizer, self).__init__(False, name)

    self.learning_rate = learning_rate
    self.weight_decay_rate = weight_decay_rate
    self.beta_1 = beta_1
    self.beta_2 = beta_2
    self.epsilon = epsilon
    self.exclude_from_weight_decay = exclude_from_weight_decay

  def apply_gradients(self, grads_and_vars, global_step=None, name=None):
    """See base class."""
    assignments = []
    # grads_and_vars 梯度和要更新的变量
    for (grad, param) in grads_and_vars:
      if grad is None or param is None:
        continue

      param_name = self._get_variable_name(param.name)

      m = tf.get_variable(
          name=param_name + "/adam_m",
          shape=param.shape.as_list(),
          dtype=tf.float32,
          trainable=False,
          initializer=tf.zeros_initializer())
      v = tf.get_variable(
          name=param_name + "/adam_v",
          shape=param.shape.as_list(),
          dtype=tf.float32,
          trainable=False,
          initializer=tf.zeros_initializer())

      # Standard Adam update.
      next_m = (
          tf.multiply(self.beta_1, m) + tf.multiply(1.0 - self.beta_1, grad))
      next_v = (
          tf.multiply(self.beta_2, v) + tf.multiply(1.0 - self.beta_2,
                                                    tf.square(grad)))

      update = next_m / (tf.sqrt(next_v) + self.epsilon)

      # Just adding the square of the weights to the loss function is *not*
      # the correct way of using L2 regularization/weight decay with Adam,
      # since that will interact with the m and v parameters in strange ways.
      #
      # Instead we want ot decay the weights in a manner that doesn't interact
      # with the m/v parameters. This is equivalent to adding the square
      # of the weights to the loss with plain (non-momentum) SGD.
      if self._do_use_weight_decay(param_name):
        update += self.weight_decay_rate * param

      update_with_lr = self.learning_rate * update

      next_param = param - update_with_lr

      assignments.extend(
          [param.assign(next_param),
           m.assign(next_m),
           v.assign(next_v)])
    return tf.group(*assignments, name=name)

  def _do_use_weight_decay(self, param_name):
    """Whether to use L2 weight decay for `param_name`."""
    if not self.weight_decay_rate:
      return False
    if self.exclude_from_weight_decay:
      for r in self.exclude_from_weight_decay:
        if re.search(r, param_name) is not None:
          return False
    return True

  def _get_variable_name(self, param_name):
    """Get the variable name from the tensor name."""
    m = re.match("^(.*):\\d+$", param_name)
    if m is not None:
      param_name = m.group(1)
    return param_name

2.2 optimizer返回梯度更新执行单元

参数优化执行需要以下几个参数：

global_step 模型训练执行的全局步数，用于学习率的变化
learning_rate 学习率，通过多项式衰减变化
num_warmup_steps 模型预热，通过全局步数/预热步数*lr来变化学习率

关键变量：

tvars 模型需要训练的参数
grads 根据loss计算参数的梯度，并通过梯度裁剪（clip_norm是裁剪的比例）
train_op 这里我理解为反向传播的整个执行单元
new_global_step 反向传播通常会自动更新global_step，这里AdamWeightDecayOptimizer并没有做，所以需要手动更新global_step

3. 模型评估

3.1 metrics_from_confusion_matrix

根据混淆矩阵（confusion_matrix是记录预测值和真实值的矩阵，可以方便的计算tp fp fn tn）计算准确率、召回率和F1 参考 confusion_matrix

def metrics_from_confusion_matrix(cm, pos_indices=None, average='micro',
                                  beta=1):
    """Precision, Recall and F1 from the confusion matrix
    Parameters
    ----------
    cm : tf.Tensor of type tf.int32, of shape (num_classes, num_classes)
        The streaming confusion matrix.
    pos_indices : list of int, optional
        The indices of the positive classes
    beta : int, optional
        Weight of precision in harmonic mean
    average : str, optional
        'micro', 'macro' or 'weighted'
    """
    num_classes = cm.shape[0]
    if pos_indices is None:
        pos_indices = [i for i in range(num_classes)]

    if average == 'micro':
        return pr_re_fbeta(cm, pos_indices, beta)
    elif average in {'macro', 'weighted'}:
        precisions, recalls, fbetas, n_golds = [], [], [], []
        for idx in pos_indices:
            pr, re, fbeta = pr_re_fbeta(cm, [idx], beta)
            precisions.append(pr)
            recalls.append(re)
            fbetas.append(fbeta)
            cm_mask = np.zeros([num_classes, num_classes])
            cm_mask[idx, :] = 1
            n_golds.append(tf.to_float(tf.reduce_sum(cm * cm_mask)))

        if average == 'macro':
            pr = tf.reduce_mean(precisions)
            re = tf.reduce_mean(recalls)
            fbeta = tf.reduce_mean(fbetas)
            return pr, re, fbeta
        if average == 'weighted':
            n_gold = tf.reduce_sum(n_golds)
            pr_sum = sum(p * n for p, n in zip(precisions, n_golds))
            pr = safe_div(pr_sum, n_gold)
            re_sum = sum(r * n for r, n in zip(recalls, n_golds))
            re = safe_div(re_sum, n_gold)
            fbeta_sum = sum(f * n for f, n in zip(fbetas, n_golds))
            fbeta = safe_div(fbeta_sum, n_gold)
            return pr, re, fbeta

    else:
        raise NotImplementedError()

3.2 precision

构建混淆矩阵，然后根据混淆矩阵计算准确率。这里主要到混淆矩阵其实返回了两个值：

total_cm: confusion matrix.
update_op: An operation that increments the confusion matrix.
关于第二个变量先留个坑，以后来填。（涉及计算图、分布式计算貌似）
根据混淆矩阵计算了两个准确率。关于下面查全率和F1的计算与准确率类似。

def precision(labels, predictions, num_classes, pos_indices=None,
              weights=None, average='micro'):
    """Multi-class precision metric for Tensorflow
    Parameters
    ----------
    labels : Tensor of tf.int32 or tf.int64
        The true labels
    predictions : Tensor of tf.int32 or tf.int64
        The predictions, same shape as labels
    num_classes : int
        The number of classes
    pos_indices : list of int, optional
        The indices of the positive classes, default is all
    weights : Tensor of tf.int32, optional
        Mask, must be of compatible shape with labels
    average : str, optional
        'micro': counts the total number of true positives, false
            positives, and false negatives for the classes in
            `pos_indices` and infer the metric from it.
        'macro': will compute the metric separately for each class in
            `pos_indices` and average. Will not account for class
            imbalance.
        'weighted': will compute the metric separately for each class in
            `pos_indices` and perform a weighted average by the total
            number of true labels for each class.
    Returns
    -------
    tuple of (scalar float Tensor, update_op)
    """
    cm, op = _streaming_confusion_matrix(
        labels, predictions, num_classes, weights)
    pr, _, _ = metrics_from_confusion_matrix(
        cm, pos_indices, average=average)
    op, _, _ = metrics_from_confusion_matrix(
        op, pos_indices, average=average)
    return (pr, op)

3.3 recall

def recall(labels, predictions, num_classes, pos_indices=None, weights=None,
           average='micro'):
    """Multi-class recall metric for Tensorflow
    Parameters
    ----------
    labels : Tensor of tf.int32 or tf.int64
        The true labels
    predictions : Tensor of tf.int32 or tf.int64
        The predictions, same shape as labels
    num_classes : int
        The number of classes
    pos_indices : list of int, optional
        The indices of the positive classes, default is all
    weights : Tensor of tf.int32, optional
        Mask, must be of compatible shape with labels
    average : str, optional
        'micro': counts the total number of true positives, false
            positives, and false negatives for the classes in
            `pos_indices` and infer the metric from it.
        'macro': will compute the metric separately for each class in
            `pos_indices` and average. Will not account for class
            imbalance.
        'weighted': will compute the metric separately for each class in
            `pos_indices` and perform a weighted average by the total
            number of true labels for each class.
    Returns
    -------
    tuple of (scalar float Tensor, update_op)
    """
    cm, op = _streaming_confusion_matrix(
        labels, predictions, num_classes, weights)
    _, re, _ = metrics_from_confusion_matrix(
        cm, pos_indices, average=average)
    _, op, _ = metrics_from_confusion_matrix(
        op, pos_indices, average=average)
    return (re, op)

3.4 F1

def f1(labels, predictions, num_classes, pos_indices=None, weights=None,
       average='micro'):
    return fbeta(labels, predictions, num_classes, pos_indices, weights,
                 average)


def fbeta(labels, predictions, num_classes, pos_indices=None, weights=None,
          average='micro', beta=1):
    """Multi-class fbeta metric for Tensorflow
    Parameters
    ----------
    labels : Tensor of tf.int32 or tf.int64
        The true labels
    predictions : Tensor of tf.int32 or tf.int64
        The predictions, same shape as labels
    num_classes : int
        The number of classes
    pos_indices : list of int, optional
        The indices of the positive classes, default is all
    weights : Tensor of tf.int32, optional
        Mask, must be of compatible shape with labels
    average : str, optional
        'micro': counts the total number of true positives, false
            positives, and false negatives for the classes in
            `pos_indices` and infer the metric from it.
        'macro': will compute the metric separately for each class in
            `pos_indices` and average. Will not account for class
            imbalance.
        'weighted': will compute the metric separately for each class in
            `pos_indices` and perform a weighted average by the total
            number of true labels for each class.
    beta : int, optional
        Weight of precision in harmonic mean
    Returns
    -------
    tuple of (scalar float Tensor, update_op)
    """
    cm, op = _streaming_confusion_matrix(
        labels, predictions, num_classes, weights)
    _, _, fbeta = metrics_from_confusion_matrix(
        cm, pos_indices, average=average, beta=beta)
    _, _, op = metrics_from_confusion_matrix(
        op, pos_indices, average=average, beta=beta)
    return (fbeta, op)

4. 数据准备

4.1 原文本

以下内容部分是参考 BERT解读
模型最初的数据形式如下，源码中对于这个输入文件的要求为：

一行一句话，最好是真实文本中的一句话，而不是一段话或者一句话的部分
文档之间用空行分开，防止NSP任务跨文档。

从数据准备的主函数来看，重点是create_training_instances也就是构建训练集列表，每个元素为类实例。

This text is included to make sure Unicode is handled properly: 力加勝北区ᴵᴺᵀᵃছজটডণত
Text should be one-sentence-per-line, with empty lines between documents.
This sample text is public domain and was randomly selected from Project Guttenberg.

The rain had only ceased with the gray streaks of morning at Blazing Star, and the settlement awoke to a moral sense of cleanliness, and the finding of forgotten knives, tin cups, and smaller camp utensils, where the heavy showers had washed away the debris and dust heaps before the cabin doors.
Indeed, it was recorded in Blazing Star that a fortunate early riser had once picked up on the highway a solid chunk of gold quartz which the rain had freed from its incumbering soil, and washed into immediate and glittering popularity.
Possibly this may have been the reason why early risers in that locality, during the rainy season, adopted a thoughtful habit of body, and seldom lifted their eyes to the rifted or india-ink washed skies above them.
"Cass" Beard had risen early that morning, but not with a view to discovery.
A leak in his cabin roof,--quite consistent with his careless, improvident habits,--had roused him at 4 A. M., with a flooded "bunk" and wet blankets.
The chips from his wood pile refused to kindle a fire to dry his bed-clothes, and he had recourse to a more provident neighbor's to supply the deficiency.
This was nearly opposite.
Mr. Cassius crossed the highway, and stopped suddenly.
Something glittered in the nearest red pool before him.
Gold, surely!
But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible, but a bit of jeweler's handicraft in the form of a plain gold ring.
Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
Like most of his fellow gold-seekers, Cass was superstitious.

The fountain of classic wisdom, Hypatia herself.
As the ancient sage--the name is unimportant to a monk--pumped water nightly that he might study by day, so I, the guardian of cloaks and parasols, at the sacred doors of her lecture-room, imbibe celestial knowledge.
From my youth I felt in me a soul above the matter-entangled herd.
She revealed to me the glorious fact, that I am a spark of Divinity itself.
A fallen star, I am, sir!' continued he, pensively, stroking his lean stomach--'a fallen star!--fallen, if the dignity of philosophy will allow of the simile, among the hogs of the lower world--indeed, even into the hog-bucket itself. Well, after all, I will show you the way to the Archbishop's.
There is a philosophic pleasure in opening one's treasures to the modest young.
Perhaps you will assist me by carrying this basket of fruit?' And the little man jumped up, put his basket on Philammon's head, and trotted off up a neighbouring street.
Philammon followed, half contemptuous, half wondering at what this philosophy might be, which could feed the self-conceit of anything so abject as his ragged little apish guide;
but the novel roar and whirl of the street, the perpetual stream of busy faces, the line of curricles, palanquins, laden asses, camels, elephants, which met and passed him, and squeezed him up steps and into doorways, as they threaded their way through the great Moon-gate into the ample street beyond, drove everything from his mind but wondering curiosity, and a vague, helpless dread of that great living wilderness, more terrible than any dead wilderness of sand which he had left behind.
Already he longed for the repose, the silence of the Laura--for faces which knew him and smiled upon him; but it was too late to turn back now.
His guide held on for more than a mile up the great main street, crossed in the centre of the city, at right angles, by one equally magnificent, at each end of which, miles away, appeared, dim and distant over the heads of the living stream of passengers, the yellow sand-hills of the desert;
while at the end of the vista in front of them gleamed the blue harbour, through a network of countless masts.
At last they reached the quay at the opposite end of the street;
and there burst on Philammon's astonished eyes a vast semicircle of blue sea, ringed with palaces and towers.
He stopped involuntarily; and his little guide stopped also, and looked askance at the young monk, to watch the effect which that grand panorama should produce on him.

4.2 处理逻辑

对上述文件的处理代码如下：
主函数

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)
  # 端到端的token处理，对应词汇表
  tokenizer = tokenization.FullTokenizer(
      vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)

  input_files = []
  for input_pattern in FLAGS.input_file.split(","):
    input_files.extend(tf.gfile.Glob(input_pattern))

  tf.logging.info("*** Reading from input files ***")
  for input_file in input_files:
    tf.logging.info("  %s", input_file)

  rng = random.Random(FLAGS.random_seed)
  instances = create_training_instances(
      input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
      FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
      rng)

  output_files = FLAGS.output_file.split(",")
  tf.logging.info("*** Writing to output files ***")
  for output_file in output_files:
    tf.logging.info("  %s", output_file)
      
  # 将类列表写入到输出文件中
  write_instance_to_example_files(instances, tokenizer, FLAGS.max_seq_length,
                                  FLAGS.max_predictions_per_seq, output_files)

create_training_instances创建样本列表

rng = random.Random(FLAGS.random_seed)
instances = create_training_instances(
  input_files, tokenizer, FLAGS.max_seq_length, FLAGS.dupe_factor,
  FLAGS.short_seq_prob, FLAGS.masked_lm_prob, FLAGS.max_predictions_per_seq,
  rng)
print(instances[0])

输出的示例如下：

tokens: [CLS] for more than a [MASK] up [MASK] great main street , [MASK] [MASK] [MASK] centre of the city , at right angles , [MASK] one equally magnificent , at each end ##ミ which , miles away , appeared , dim and distant over the heads of the living stream of passengers , the yellow [MASK] - hills of [MASK] [MASK] ; while at the end of the vista in front [MASK] them gleamed the blue harbour , through a network [SEP] possibly this may have been the reason why early rise ##rs in [MASK] locality , during the rainy season , adopted [MASK] [MASK] [MASK] of body , and seldom lifted [MASK] eyes [MASK] the rift [MASK] or india - ink washed skies above them . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: True
masked_lm_positions: 5 7 12 13 14 24 32 55 59 60 71 94 103 104 105 109 112 114 117
masked_lm_labels: mile the crossed in the by of sand the desert of that a thoughtful habit and their to ##ed

可见，其输出有：

tokens：两句话pack在一起，并加上[MASK]、[CLS]和[SEP]之后的token序列
segment_ids：句子标识符，0表示第一句话，1表示第二句话
is_random_next：是否为下一句，NSP任务的label
masked_lm_positions：用于MLM任务，标识哪些位置被mask了
masked_lm_labels：用于MLM任务，表示被mask的位置上的label

下面我们来看create_training_instances的执行流程：

读取文件，生成一个大列表，大列表每个小列表装着一段话，以每句的形式构成小列表
将输入的文本分字后，通过vocab映射为id
这里dupe_factor是为了生成不同的MASK
调用create_instances_from_document生成实际的训练数据

def create_training_instances(input_files, tokenizer, max_seq_length,
                              dupe_factor, short_seq_prob, masked_lm_prob,
                              max_predictions_per_seq, rng):
  """Create `TrainingInstance`s from raw text."""
  all_documents = [[]]
  for input_file in input_files:
    with tf.gfile.GFile(input_file, "r") as reader:
      while True:
        line = tokenization.convert_to_unicode(reader.readline())
        if not line:
          break
        line = line.strip()

        # Empty lines are used as document delimiters
        if not line:
          all_documents.append([])
        tokens = tokenizer.tokenize(line)
        if tokens:
          all_documents[-1].append(tokens)

  # Remove empty documents
  all_documents = [x for x in all_documents if x]
  rng.shuffle(all_documents)

  vocab_words = list(tokenizer.vocab.keys())
  instances = []
  for _ in range(dupe_factor):
    for document_index in range(len(all_documents)):
      instances.extend(
          create_instances_from_document(
              all_documents, document_index, max_seq_length, short_seq_prob,
              masked_lm_prob, max_predictions_per_seq, vocab_words, rng))

  rng.shuffle(instances)
  return instances

下面看一下create_instances_from_document生成训练数据的逻辑：

def create_instances_from_document(
        all_documents, document_index, max_seq_length, short_seq_prob,
        masked_lm_prob, max_predictions_per_seq, vocab_words, rng):
    """Creates `TrainingInstance`s for a single document."""
    document = all_documents[document_index]

    # Account for [CLS], [SEP], [SEP]
    max_num_tokens = max_seq_length - 3

    target_seq_length = max_num_tokens
    if rng.random() < short_seq_prob:
        target_seq_length = rng.randint(2, max_num_tokens)

    instances = []
    current_chunk = []
    current_length = 0
    i = 0
    while i < len(document):
        segment = document[i]
        current_chunk.append(segment)
        current_length += len(segment)
        # 文档结尾 or 当前长度大于目标长度
        if i == len(document) - 1 or current_length >= target_seq_length:
            if current_chunk:
                a_end = 1
                if len(current_chunk) >= 2:
                    # 随机拆分为 两句
                    a_end = rng.randint(1, len(current_chunk) - 1)

                tokens_a = []
                for j in range(a_end):
                    tokens_a.extend(current_chunk[j])

                tokens_b = []
                # Random next
                is_random_next = False
                # 随机下一句或不随机
                if len(current_chunk) == 1 or rng.random() < 0.5:
                    is_random_next = True
                    target_b_length = target_seq_length - len(tokens_a)

                    for _ in range(10):
                        random_document_index = rng.randint(0, len(all_documents) - 1)
                        if random_document_index != document_index:
                            break

                    random_document = all_documents[random_document_index]
                    random_start = rng.randint(0, len(random_document) - 1)
                    for j in range(random_start, len(random_document)):
                        tokens_b.extend(random_document[j])
                        if len(tokens_b) >= target_b_length:
                            break
                    # 更新i：因为随机下一句，导致后半部分没有用到，构建下个句子组使用
                    num_unused_segments = len(current_chunk) - a_end
                    i -= num_unused_segments
                # Actual next
                else:
                    is_random_next = False
                    for j in range(a_end, len(current_chunk)):
                        tokens_b.extend(current_chunk[j])
                truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng)

                assert len(tokens_a) >= 1
                assert len(tokens_b) >= 1
                # 构建tokens和segment_ids
                tokens = []
                segment_ids = []
                tokens.append("[CLS]")
                segment_ids.append(0)
                for token in tokens_a:
                    tokens.append(token)
                    segment_ids.append(0)

                tokens.append("[SEP]")
                segment_ids.append(0)

                for token in tokens_b:
                    tokens.append(token)
                    segment_ids.append(1)
                tokens.append("[SEP]")
                segment_ids.append(1)

                (tokens, masked_lm_positions,
                 masked_lm_labels) = create_masked_lm_predictions(
                    tokens, masked_lm_prob, max_predictions_per_seq, vocab_words, rng)
                instance = TrainingInstance(
                    tokens=tokens,
                    segment_ids=segment_ids,
                    is_random_next=is_random_next,
                    masked_lm_positions=masked_lm_positions,
                    masked_lm_labels=masked_lm_labels)
                instances.append(instance)
            current_chunk = []
            current_length = 0
        i += 1
    return instances

到这里，数据准备就剩下创建MASKcreate_masked_lm_predictions和构建最终的训练集的类TrainingInstance.
以下是create_masked_lm_predictions的函数构建，解释很详细了。

# 构建mask 返回token mask位置 label值
def create_masked_lm_predictions(tokens, masked_lm_prob,
                                 max_predictions_per_seq, vocab_words, rng):
  """Creates the predictions for the masked LM objective."""

  cand_indexes = []
  for (i, token) in enumerate(tokens):
    if token == "[CLS]" or token == "[SEP]":
      continue
    # Whole Word Masking means that if we mask all of the wordpieces
    # corresponding to an original word. When a word has been split into
    # WordPieces, the first token does not have any marker and any subsequence
    # tokens are prefixed with ##. So whenever we see the ## token, we
    # append it to the previous set of word indexes.
    #
    # Note that Whole Word Masking does *not* change the training code
    # at all -- we still predict each WordPiece independently, softmaxed
    # over the entire vocabulary.
    if (FLAGS.do_whole_word_mask and len(cand_indexes) >= 1 and
        token.startswith("##")):
      # 将##为前缀的加到前一个字的列表中，如果是做整词预测
      cand_indexes[-1].append(i)
    else:
      cand_indexes.append([i])

  rng.shuffle(cand_indexes)

  output_tokens = list(tokens)
  
  # 确定需要预测的字数量
  num_to_predict = min(max_predictions_per_seq,
                       max(1, int(round(len(tokens) * masked_lm_prob))))

  masked_lms = []
  covered_indexes = set()
  for index_set in cand_indexes:
    if len(masked_lms) >= num_to_predict:
      break
    # If adding a whole-word mask would exceed the maximum number of
    # predictions, then just skip this candidate.
    # 如果加上这整个单词超过最大预测数，则跳过这个候选词
    if len(masked_lms) + len(index_set) > num_to_predict:
      continue
    is_any_index_covered = False
    # 是否是否有字已经包含在非重复集合里面
    for index in index_set:
      if index in covered_indexes:
        is_any_index_covered = True
        break
    if is_any_index_covered:
      continue
    for index in index_set:
      covered_indexes.add(index)

      masked_token = None
      # 80% of the time, replace with [MASK]
      if rng.random() < 0.8:
        masked_token = "[MASK]"
      else:
        # 10% of the time, keep original
        if rng.random() < 0.5:
          masked_token = tokens[index]
        # 10% of the time, replace with random word
        else:
          masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)]
      # 替换成相应的masked_token
      output_tokens[index] = masked_token

      masked_lms.append(MaskedLmInstance(index=index, label=tokens[index]))
  assert len(masked_lms) <= num_to_predict
  masked_lms = sorted(masked_lms, key=lambda x: x.index)

  masked_lm_positions = []
  masked_lm_labels = []
  for p in masked_lms:
    masked_lm_positions.append(p.index)
    masked_lm_labels.append(p.label)

  return (output_tokens, masked_lm_positions, masked_lm_labels)

下面是构建训练集定制类，添加了打印训练样本信息的函数。

class TrainingInstance(object):
  """A single training instance (sentence pair)."""

  def __init__(self, tokens, segment_ids, masked_lm_positions, masked_lm_labels,
               is_random_next):
    self.tokens = tokens
    self.segment_ids = segment_ids
    self.is_random_next = is_random_next
    self.masked_lm_positions = masked_lm_positions
    self.masked_lm_labels = masked_lm_labels

  def __str__(self):
    s = ""
    s += "tokens: %s\n" % (" ".join(
        [tokenization.printable_text(x) for x in self.tokens]))
    s += "segment_ids: %s\n" % (" ".join([str(x) for x in self.segment_ids]))
    s += "is_random_next: %s\n" % self.is_random_next
    s += "masked_lm_positions: %s\n" % (" ".join(
        [str(x) for x in self.masked_lm_positions]))
    s += "masked_lm_labels: %s\n" % (" ".join(
        [tokenization.printable_text(x) for x in self.masked_lm_labels]))
    s += "\n"
    return s

  def __repr__(self):
    return self.__str__()

5. 预训练 pre-training

5.1 主函数 main()

预训练的主函数比较新的是进一步封装了训练的硬件信息为estimator，便于训练设备的切换(如果没有TPU,使用GPU或CPU)，estimator执行训练和评估过程。
主函数调用了两个函数：模型构建函数和输入构建函数，接下来看一下。

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)  # 设定日志输出类型

  if not FLAGS.do_train and not FLAGS.do_eval:
    raise ValueError("At least one of `do_train` or `do_eval` must be True.")

  bert_config = modeling.BertConfig.from_json_file(FLAGS.bert_config_file)

  tf.gfile.MakeDirs(FLAGS.output_dir)

  input_files = []
  for input_pattern in FLAGS.input_file.split(","):
    input_files.extend(tf.gfile.Glob(input_pattern))

  tf.logging.info("*** Input Files ***")
  for input_file in input_files:
    tf.logging.info("  %s" % input_file)

  tpu_cluster_resolver = None
  if FLAGS.use_tpu and FLAGS.tpu_name:
    tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(
        FLAGS.tpu_name, zone=FLAGS.tpu_zone, project=FLAGS.gcp_project)

  is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2
  run_config = tf.contrib.tpu.RunConfig(
      cluster=tpu_cluster_resolver,
      master=FLAGS.master,
      model_dir=FLAGS.output_dir,
      save_checkpoints_steps=FLAGS.save_checkpoints_steps,
      tpu_config=tf.contrib.tpu.TPUConfig(
          iterations_per_loop=FLAGS.iterations_per_loop,
          num_shards=FLAGS.num_tpu_cores,
          per_host_input_for_training=is_per_host))

  model_fn = model_fn_builder(
      bert_config=bert_config,
      init_checkpoint=FLAGS.init_checkpoint,
      learning_rate=FLAGS.learning_rate,
      num_train_steps=FLAGS.num_train_steps,
      num_warmup_steps=FLAGS.num_warmup_steps,
      use_tpu=FLAGS.use_tpu,
      use_one_hot_embeddings=FLAGS.use_tpu)

  # If TPU is not available, this will fall back to normal Estimator on CPU
  # or GPU.
  estimator = tf.contrib.tpu.TPUEstimator(
      use_tpu=FLAGS.use_tpu,
      model_fn=model_fn,
      config=run_config,
      train_batch_size=FLAGS.train_batch_size,
      eval_batch_size=FLAGS.eval_batch_size)

  if FLAGS.do_train:
    tf.logging.info("***** Running training *****")
    tf.logging.info("  Batch size = %d", FLAGS.train_batch_size)
    train_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=FLAGS.max_seq_length,
        max_predictions_per_seq=FLAGS.max_predictions_per_seq,
        is_training=True)
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.num_train_steps)

  if FLAGS.do_eval:
    tf.logging.info("***** Running evaluation *****")
    tf.logging.info("  Batch size = %d", FLAGS.eval_batch_size)

    eval_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=FLAGS.max_seq_length,
        max_predictions_per_seq=FLAGS.max_predictions_per_seq,
        is_training=False)

    result = estimator.evaluate(
        input_fn=eval_input_fn, steps=FLAGS.max_eval_steps)

    output_eval_file = os.path.join(FLAGS.output_dir, "eval_results.txt")
    with tf.gfile.GFile(output_eval_file, "w") as writer:
      tf.logging.info("***** Eval results *****")
      for key in sorted(result.keys()):
        tf.logging.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))

5.2 模型构建 model_fn_builder

分为两种模式：

训练模式，分别计算MASK和下一句预测的损失，将损失用于计算梯度，更新梯度。
评估模式，主要计算模型效果的各种指标

这里不详细叙述(有一些细节我也没仔细看)。

def model_fn_builder(bert_config, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, use_tpu,
                     use_one_hot_embeddings):
  """Returns `model_fn` closure for TPUEstimator."""

  def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
    """The `model_fn` for TPUEstimator."""

    tf.logging.info("*** Features ***")
    for name in sorted(features.keys()):
      tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))

    input_ids = features["input_ids"]
    input_mask = features["input_mask"]
    segment_ids = features["segment_ids"]
    masked_lm_positions = features["masked_lm_positions"]
    masked_lm_ids = features["masked_lm_ids"]
    masked_lm_weights = features["masked_lm_weights"]
    next_sentence_labels = features["next_sentence_labels"]

    is_training = (mode == tf.estimator.ModeKeys.TRAIN)

    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)

    (masked_lm_loss,
     masked_lm_example_loss, masked_lm_log_probs) = get_masked_lm_output(
         bert_config, model.get_sequence_output(), model.get_embedding_table(),
         masked_lm_positions, masked_lm_ids, masked_lm_weights)

    (next_sentence_loss, next_sentence_example_loss,
     next_sentence_log_probs) = get_next_sentence_output(
         bert_config, model.get_pooled_output(), next_sentence_labels)

    total_loss = masked_lm_loss + next_sentence_loss

    tvars = tf.trainable_variables()

    initialized_variable_names = {}
    scaffold_fn = None
    if init_checkpoint:
      (assignment_map, initialized_variable_names
      ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
      if use_tpu:

        def tpu_scaffold():
          tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
          return tf.train.Scaffold()

        scaffold_fn = tpu_scaffold
      else:
        tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

    tf.logging.info("**** Trainable Variables ****")
    for var in tvars:
      init_string = ""
      if var.name in initialized_variable_names:
        init_string = ", *INIT_FROM_CKPT*"
      tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,
                      init_string)

    output_spec = None
    if mode == tf.estimator.ModeKeys.TRAIN:
      # 根据损失计算梯度
      train_op = optimization.create_optimizer(
          total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)

      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=total_loss,
          train_op=train_op,
          scaffold_fn=scaffold_fn)
    elif mode == tf.estimator.ModeKeys.EVAL:

      def metric_fn(masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
                    masked_lm_weights, next_sentence_example_loss,
                    next_sentence_log_probs, next_sentence_labels):
        """Computes the loss and accuracy of the model."""
        masked_lm_log_probs = tf.reshape(masked_lm_log_probs,
                                         [-1, masked_lm_log_probs.shape[-1]])
        masked_lm_predictions = tf.argmax(
            masked_lm_log_probs, axis=-1, output_type=tf.int32)
        masked_lm_example_loss = tf.reshape(masked_lm_example_loss, [-1])
        masked_lm_ids = tf.reshape(masked_lm_ids, [-1])
        masked_lm_weights = tf.reshape(masked_lm_weights, [-1])
        masked_lm_accuracy = tf.metrics.accuracy(
            labels=masked_lm_ids,
            predictions=masked_lm_predictions,
            weights=masked_lm_weights)
        masked_lm_mean_loss = tf.metrics.mean(
            values=masked_lm_example_loss, weights=masked_lm_weights)

        next_sentence_log_probs = tf.reshape(
            next_sentence_log_probs, [-1, next_sentence_log_probs.shape[-1]])
        next_sentence_predictions = tf.argmax(
            next_sentence_log_probs, axis=-1, output_type=tf.int32)
        next_sentence_labels = tf.reshape(next_sentence_labels, [-1])
        next_sentence_accuracy = tf.metrics.accuracy(
            labels=next_sentence_labels, predictions=next_sentence_predictions)
        next_sentence_mean_loss = tf.metrics.mean(
            values=next_sentence_example_loss)

        return {
            "masked_lm_accuracy": masked_lm_accuracy,
            "masked_lm_loss": masked_lm_mean_loss,
            "next_sentence_accuracy": next_sentence_accuracy,
            "next_sentence_loss": next_sentence_mean_loss,
        }

      eval_metrics = (metric_fn, [
          masked_lm_example_loss, masked_lm_log_probs, masked_lm_ids,
          masked_lm_weights, next_sentence_example_loss,
          next_sentence_log_probs, next_sentence_labels
      ])
      output_spec = tf.contrib.tpu.TPUEstimatorSpec(
          mode=mode,
          loss=total_loss,
          eval_metrics=eval_metrics,
          scaffold_fn=scaffold_fn)
    else:
      raise ValueError("Only TRAIN and EVAL modes are supported: %s" % (mode))

    return output_spec

  return model_fn

5.3 构建训练集 input_fn_builder

采用from_tensor_slices构建数据集，可以参考 Tensorflow tf.data.Dataset
通过parallel_interleave()映射tf.data.TFRecordDataset并行生成嵌套数据集。

def input_fn_builder(input_files,
                     max_seq_length,
                     max_predictions_per_seq,
                     is_training,
                     num_cpu_threads=4):
  """Creates an `input_fn` closure to be passed to TPUEstimator."""

  def input_fn(params):
    """The actual input function."""
    batch_size = params["batch_size"]
    # 定义解析字典
    name_to_features = {
        "input_ids":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "input_mask":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "segment_ids":
            tf.FixedLenFeature([max_seq_length], tf.int64),
        "masked_lm_positions":
            tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
        "masked_lm_ids":
            tf.FixedLenFeature([max_predictions_per_seq], tf.int64),
        "masked_lm_weights":
            tf.FixedLenFeature([max_predictions_per_seq], tf.float32),
        "next_sentence_labels":
            tf.FixedLenFeature([1], tf.int64),
    }

    # For training, we want a lot of parallel reading and shuffling.
    # For eval, we want no shuffling and parallel reading doesn't matter.
    if is_training:
      d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))
      d = d.repeat()
      d = d.shuffle(buffer_size=len(input_files))

      # `cycle_length` is the number of parallel files that get read.
      cycle_length = min(num_cpu_threads, len(input_files))

      # `sloppy` mode means that the interleaving is not exact. This adds
      # even more randomness to the training pipeline.
      d = d.apply(
          tf.contrib.data.parallel_interleave(
              tf.data.TFRecordDataset,
              sloppy=is_training,
              cycle_length=cycle_length))
      d = d.shuffle(buffer_size=100)
    else:
      d = tf.data.TFRecordDataset(input_files)
      # Since we evaluate for a fixed number of steps we don't want to encounter
      # out-of-range exceptions.
      d = d.repeat()

    # We must `drop_remainder` on training because the TPU requires fixed
    # size dimensions. For eval, we assume we are evaluating on the CPU or GPU
    # and we *don't* want to drop the remainder, otherwise we wont cover
    # every sample.
    d = d.apply(
        tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features),
            batch_size=batch_size,
            num_parallel_batches=num_cpu_threads,
            drop_remainder=True))
    return d

  return input_fn

至此，整体的源代码预训练初步看完了，后续结合fine-tuning再补充。

1. [NLP自然语言处理]谷歌BERT模型深度解析
2. tf.clip_by_global_norm理解
3. TensorFlow学习（十一）：保存TFRecord文件

你可能感兴趣的:(NLP)

免费的GPT可在线直接使用（一键收藏） kkai人工智能 gpt
1、LuminAI（https://kk.zlrxjh.top）LuminAI标志着一款融合了星辰大数据模型与文脉深度模型的先进知识增强型语言处理系统，旨在自然语言处理（NLP）的技术开发领域发光发热。此系统展现了卓越的语义把握与内容生成能力，轻松驾驭多样化的自然语言处理任务。VisionAI在NLP界的应用领域广泛，能够胜任从机器翻译、文本概要撰写、情绪分析到问答等众多任务。通过对大量文本数据的
机器学习-聚类算法不良人龍木木机器学习机器学习算法聚类
机器学习-聚类算法1.AHC2.K-means3.SC4.MCL仅个人笔记，感谢点赞关注！1.AHC2.K-means3.SC传统谱聚类：个人对谱聚类算法的理解以及改进4.MCL目前仅专注于NLP的技术学习和分享感谢大家的关注与支持！
轻量级模型解读——轻量transformer系列 lishanlu136 #图像分类轻量级模型 transformer 图像分类
先占坑，持续更新。。。文章目录1、DeiT2、ConViT3、Mobile-Former4、MobileViTTransformer是2017谷歌提出的一篇论文，最早应用于NLP领域的机器翻译工作，Transformer解读，但随着2020年DETR和ViT的出现(DETR解读，ViT解读)，其在视觉领域的应用也如雨后春笋般渐渐出现，其特有的全局注意力机制给图像识别领域带来了重要参考。但是tran
FlagEmbedding 吉小雨 python库 python
FlagEmbedding教程FlagEmbedding是一个用于生成文本嵌入（textembeddings）的库，适合处理自然语言处理（NLP）中的各种任务。嵌入（embeddings）是将文本表示为连续向量，能够捕捉语义上的相似性，常用于文本分类、聚类、信息检索等场景。官方文档链接：FlagEmbedding官方GitHub一、FlagEmbedding库概述1.1什么是FlagEmbeddi
【NumPy】深入解析numpy.zeros()函数二七830 numpy
欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是二七830，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其是在NLP领域，我积累了丰富的经验，能够处理各种复杂的自然语言任务。技术专长：我熟练掌握Python编程语言，并深入研究了机
CV、NLP、数据控掘推荐、量化海的那边- AI算法自然语言处理人工智能
下面是对CV（计算机视觉）、NLP（自然语言处理）、数据挖掘推荐和量化的简要概述及其应用领域的介绍：1.CV（计算机视觉，ComputerVision）定义：计算机视觉是一门让计算机能够从图像或视频中提取有用信息，并做出决策的学科。它通过模拟人类的视觉系统来识别、处理和理解视觉信息。主要任务：图像分类：识别图像中的物体并分类，比如猫、狗、车等。目标检测：在图像或视频中定位并识别多个对象，如人脸检测
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式 m0_57781768 语言模型 json 人工智能
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式在现代自然语言处理（NLP）的应用中，大型语言模型（LLM）已经成为了重要的工具。这些模型能够生成丰富的自然语言文本，适用于各种应用场景。然而，在某些应用中，开发者不仅仅需要生成文本，还需要将这些生成的文本转换为结构化的数据格式，例如JSON。这种结构化的数据格式在数据传输、存储以及进一步处理时具有显著优势。本文将深
使用LangChain和OpenAI实现高效文本标注 aehrutktrjk langchain python
使用LangChain和OpenAI实现高效文本标注引言在自然语言处理(NLP)领域，文本标注是一项重要且常见的任务。它涉及为文本分配标签，如情感、语言、风格等。本文将介绍如何使用LangChain和OpenAI的API来实现高效的文本标注系统。我们将探讨如何设置环境、定义标注模式，以及如何使用OpenAI的模型来执行标注任务。环境准备首先，我们需要安装必要的库并设置API密钥：%pipinsta
【NLP5-RNN模型、LSTM模型和GRU模型】一蓑烟雨紫洛 nlp rnn lstm gru nlp
RNN模型、LSTM模型和GRU模型1、什么是RNN模型RNN（RecurrentNeuralNetwork)中文称为循环神经网络，它一般以序列数据为输入，通过网络内部的结构设计有效捕捉序列之间的关系特征，一般也是以序列形式进行输出RNN的循环机制使模型隐层上一时间步产生的结果，能够作为当下时间步输入的一部分（当下时间步的输入除了正常的输入外还包括上一步的隐层输出）对当下时间步的输出产生影响2、R
基于深度学习的文本引导的图像编辑 SEU-WYL 深度学习dnn 深度学习人工智能
基于深度学习的文本引导的图像编辑（Text-GuidedImageEditing）是一种通过自然语言文本指令对图像进行编辑或修改的技术。它结合了图像生成和自然语言处理（NLP）的最新进展，使用户能够通过描述性文本对图像内容进行精确的调整和操控。1.文本引导的图像编辑的挑战文本和图像之间的对齐：如何将文本中的语义信息准确地映射到图像中的特定区域或元素是一个关键挑战。这涉及到多模态数据的对齐和理解。编
甘超波：NLP婚姻中如何与老人相处甘超波
哈喽，大家好我是甘超波，是一名NLP爱好者，每天一篇原创文章或视频，分享我的实战经验和案例，希望给你些启发和帮助看一下，在家庭中子女与老人观念不一致时案例1：在教育孩子方面，老人习惯用老一套教育方式教育孙子，子女受不了老人这种习惯，从而发生口舌之争？2：在生活习惯方面，老人喜欢吃剩菜剩饭，子女受不了老人这种习惯，从而发生口舌之争？.....这样的事情，我相信你或多或少都听过和看过，甚至了深有感悟。
transformer架构(Transformer Architecture)原理与代码实战案例讲解 AI架构设计之禅大数据AI人工智能 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
transformer架构(TransformerArchitecture)原理与代码实战案例讲解关键词：Transformer,自注意力机制,编码器-解码器,预训练,微调,NLP,机器翻译作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来自然语言处理（NLP）领域的发展经历了从规则驱动到统计驱动再到深度学习驱动的三个阶段。
英伟达（NVIDIA）B200架构解读 weixin_41205263 芯际争霸 GPGPU架构 gpu算力人工智能硬件架构
H100芯片是一款高性能AI芯片，其中的TransformerEngine是专门用于加速Transformer模型计算的核心部件。Transformer模型是一种自然语言处理（NLP）模型，广泛应用于机器翻译、文本生成等任务。TransformerEngine的电路设计原理主要包括以下几个方面：
《昇思 25 天学习打卡营第 25 天 | 基于 MindSpore 实现 BERT 对话情绪识别》 Sam9029 Mindscope模型学习深度学习
《昇思25天学习打卡营第25天|基于MindSpore实现BERT对话情绪识别》活动地址：https://xihe.mindspore.cn/events/mindspore-training-camp签名：Sam9029环境配置确保安装了正确版本的MindSpore和MindNLP库。!pipuninstallmindspore-y!pipinstall-ihttps://pypi.mirror
基于人工智能的智能语音助手人工智能发烧友人工智能
语音助手的自然语言处理模块是语音助手系统的关键组成部分。通过这个模块，系统能够识别用户的意图并做出相应的回应。我们可以使用NLP技术来解析文本输入，并将其转换为系统可以理解的命令或指令。在本项目中，我们将结合语音识别、自然语言处理和语音合成技术，构建一个功能简化的语音助手。一、项目背景与需求分析1.1项目目标本项目旨在创建一个语音助手系统，它可以：1.语音识别：从用户的语音输入中提取文本信息。2.
NLP_jieba中文分词的常用模块 Hiweir · NLP_jieba的使用自然语言处理中文分词人工智能 nlp
1.jieba分词模式（1）精确模式:把句子最精确的切分开,比较适合文本分析.默认精确模式.（2）全模式:把句子中所有可能成词的词都扫描出来,cut_all=True,缺点:速度快,不能解决歧义（3）paddle:利用百度的paddlepaddle深度学习框架.简单来说就是使用百度提供的分词模型.use_paddle=True.（4）搜索引擎模式:在精确模式的基础上,对长词再进行切分,提高召回率,
Linux如何查看端口 lanhuazui10 linux操作系统 linux
方法一：lsof-i:端口号用于查看某一端口的占用情况，比如查看9092端口使用情况，lsof-i:9095可以看到9095端口已经被nginx占用方法二：netstat-tunlp|grep端口号，用于查看指定的端口号的进程情况，如查看5050端口的情况，netstat-tunlp|grep5050-t(tcp)仅显示tcp相关选项-u(udp)仅显示udp相关选项-n拒绝显示别名，能显示数字的
【笔记】自然语言处理NLP---概论 xhanZ NLP相关
（from人文学院开设课程）目录1.自然语言处理概论1.1自然语言处理研究的意义、历史与现状1.1.1自然语言的特点1.1.2自然语言处理研究的意义1.1.3国外研究现状1.2NLP的方法、特点和规律1.2.1理性主义与经验主义1.2.2语料库语言学：经验主义研究方法1.2.3汉语语言处理的方法1.2.4基于知识图谱的深度学习1.自然语言处理概论1.1自然语言处理研究的意义、历史与现状1.1.1自
【笔记与idea】——ACL2017论文报告会胖胖的飞象深度学习人工智能笔记 idea
这篇是2017年我有幸参加了中文信息学会组织的ACL2017论文报告会记的笔记，当时还是研一新生，对NLP感兴趣，偶然通过老师知晓了这次报告会，所以想去现场听听大牛们的idea、和大牛们交流（然而由于当时没有入门，啥也不懂，交流失败。。。）但是总的来说，非常感谢组织这次报告会的老师们，尽管没能和大牛们有效的交流，但是这次报告会相当于在最短的时间内读懂了数十篇精彩论文的核心内容，对我后面的学习起到了
如何利用AI技术来提升用户的个性化体验和社区参与度？ Itfuture03 AI前沿技术人工智能
要利用AI技术提升用户的个性化体验和社区参与度，可以采取以下几种策略：个性化推荐系统：通过AI算法分析用户的行为和偏好，提供定制化的服务和内容推荐，如智能推荐活动、健康管理等，让居民感受到社区的温暖和关怀。智能助手与聊天机器人：引入AI驱动的虚拟助手，提供实时帮助、个性化建议和交互式对话，改善客户体验。自然语言处理（NLP）：实现具有AI能力的NLP，创建对用户友好的应用程序，简化用户体验，如客服
【Python】成功解决IndexError: list index out of range 高斯小哥 BUG解决方案合集 python list 新手入门学习 debug
【Python】成功解决IndexError:listindexoutofrange下滑查看解决方法欢迎莅临我的个人主页这里是我静心耕耘深度学习领域、真诚分享知识与智慧的小天地！博主简介：985高校的普通本硕，曾有幸发表过人工智能领域的中科院顶刊一作论文，熟练掌握PyTorch框架。技术专长：在CV、NLP及多模态等领域有丰富的项目实战经验。已累计一对一为数百位用户提供近千次专业服务，助力他们少走
使用Python和Jieba库进行中文情感分析：从文本预处理到模型训练的完整指南快撑死的鱼 Python算法精解 python 人工智能开发语言
使用Python和Jieba库进行中文情感分析：从文本预处理到模型训练的完整指南情感分析（SentimentAnalysis）是自然语言处理（NLP）领域中的一个重要分支，旨在从文本中识别出情绪、态度或意见等主观信息。在中文文本处理中，由于语言特性不同于英语，如何高效、准确地分词和提取关键词成为情感分析的关键步骤之一。在这篇文章中，我们将深入探讨如何使用Python和Jieba库进行中文情感分析，
论文阅读笔记: DINOv2: Learning Robust Visual Features without Supervision 小夏refresh 论文计算机视觉深度学习论文阅读笔记深度学习计算机视觉人工智能
DINOv2:LearningRobustVisualFeatureswithoutSupervision论文地址:https://arxiv.org/abs/2304.07193代码地址:https://github.com/facebookresearch/dinov2摘要大量数据上的预训练模型在NLP方面取得突破，为计算机视觉中的类似基础模型开辟了道路。这些模型可以通过生成通用视觉特征(即无
第3篇：LangChain的架构总览与设计理念 Gemini技术窝 langchain 架构大数据人工智能 AIGC nlp
LangChain库是一个专为自然语言处理（NLP）设计的强大工具包，致力于简化复杂语言模型链的构建和执行。在本文中，我们将深入解析LangChain库的架构，详细列出其核心组件、设计理念及其在不同场景中的应用，并讨论其优缺点。文章目录1.LangChain库简介2.核心组件2.1数据输入模块作用2.2数据预处理模块作用2.3数据增强模块作用2.4数据加载与批处理模块作用2.5模型训练模块作用2.
读李中莹先生论“阿Q精神" 猫咪06
这阵子重读《重塑心灵》，对“阿Q精神"一段很有感慨，在我们从小的信念里，阿Q的精神胜利法是被贬低的，是对无能力改变自己的境遇时，似手只能采用自我安慰的人的讽刺。李中莹先生在他的书中结合对话者的认可，定义阿Q精神“只求精神胜利，罔顾真实情况"，他就针对这两句话，解析阿Q精神，并进行了肯定‘，。首先“精神胜利"指的是自己内心有成功的感觉，这很符合NLP!如果所有人都认为你成功，而你自己没有成功的喜悦，
书单用户5521
提高思维（13本）：影响力逻辑思维（理查德·尼斯贝特）离经叛道:不按常理出牌的人如何改变世界（只看最后一章总结即可）改变:问题形成和解决的原则语言的魔力:谈笑间转变信念之NLP技巧（意识到语言顺序的重要性）改变心理学的40项研究对伪心理学说不你的误区:如何摆脱负面思维掌控你的生活战胜拖拉你的灯亮着吗?别做正常的傻瓜学会提问:批判性思维指南不确定世界的理性选择小说（5本）：霍乱时期的爱情那些回不去的
【Python】解决AttributeError: ‘NoneType‘ object has no attribute ‘xxxx‘ 云天徽上 Pandas python 开发语言 pandas 机器学习 numpy
【Python】解决AttributeError:'NoneType'objecthasnoattribute'xxxx'报错欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是云天徽上，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其
【自然语言处理】自然语言处理NLP概述及应用 @我们的天空人工智能技术 nlp 人工智能深度学习 python 机器学习自然语言处理 scikit-learn
自然语言处理（NaturalLanguageProcessing，简称NLP）是一门集计算机科学、人工智能以及语言学于一体的交叉学科，致力于让计算机能够理解、解析、生成和处理人类的自然语言。它是人工智能领域的一个关键分支，旨在缩小人与机器之间的交流障碍，使得机器能够更有效地识别并响应人类的自然语言指令或内容。自然语言处理NLP概述基本任务：文本分类：将文本划分为预定义的类别，如情感分析、主题分类等
OPENAI中RAG实现原理以及示例代码用PYTHON来实现 dzend aigc python 开发语言 ai
OPENAI中RAG实现原理以及示例代码用PYTHON来实现1.引言在当今人工智能领域，自然语言处理（NLP）是一个非常重要的研究方向。近年来，OPENAI发布了许多创新的NLP模型，其中之一就是RAG（Retrieval-AugmentedGeneration）模型。RAG模型结合了检索和生成两种方法，可以用于生成与给定问题相关的高质量文本。本文将介绍RAG模型的实现原理，并提供使用Python
开源AI图像识别：支持扫描文件批量识别快速对接数据库存储思通数科x 人工智能计算机视觉图像处理 OCR 文本识别
随着数字化转型的不断深入，图像识别技术在各行各业中的应用越来越广泛。文件封识别作为图像识别技术的一个分支，能够有效地提高文件处理的自动化程度和准确性。本文将探讨文件封识别技术的原理、应用场景以及如何将识别后的内容批量对应数据库字段进行存储。开源项目介绍(可本地部署，支持国产化)思通数科研发了一款多模态AI能力引擎，专注于提供自然语言处理（NLP）、情感分析、实体识别、图像识别与分类、OCR识别和语
ztree异步加载 3213213333332132 JavaScript Ajax json Web ztree
相信新手用ztree的时候,对异步加载会有些困惑，我开始的时候也是看了API花了些时间才搞定了异步加载，在这里分享给大家。我后台代码生成的是json格式的数据，数据大家按各自的需求生成，这里只给出前端的代码。设置setting，这里只关注async属性的配置 var setting = { //异步加载配置
thirft rpc 具体调用流程 BlueSkator 中间件 rpc thrift
Thrift调用过程中，Thrift客户端和服务器之间主要用到传输层类、协议层类和处理类三个主要的核心类，这三个类的相互协作共同完成rpc的整个调用过程。在调用过程中将按照以下顺序进行协同工作：（1）将客户端程序调用的函数名和参数传递给协议层（TProtocol），协议
异或运算推导, 交换数据 dcj3sjt126com PHP 异或 ^
/* * 5 0101 * 9 1010 * * 5 ^ 5 * 0101 * 0101 * ----- * 0000 * 得出第一个规律: 相同的数进行异或, 结果是0 * * 9 ^ 5 ^ 6 * 1010 * 0101 * ---- * 1111 * * 1111 * 0110 * ---- * 1001
事件源对象周华华 JavaScript
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&q
MySql配置及相关命令 g21121 mysql
MySQL安装完毕后我们需要对它进行一些设置及性能优化，主要包括字符集设置，启动设置，连接优化，表优化，分区优化等等。一修改MySQL密码及用户
[简单]poi删除excel 2007超链接 53873039oycg Excel
采用解析sheet.xml方式删除超链接，缺点是要打开文件2次,代码如下: public void removeExcel2007AllHyperLink(String filePath) throws Exception { OPCPackage ocPkg = OPCPac
Struts2添加 open flash chart 云端月影
准备以下开源项目： 1. Struts 2.1.6 2. Open Flash Chart 2 Version 2 Lug Wyrm Charmer (28th, July 2009) 3. jofc2，这东西不知道是没做好还是什么意思，好像和ofc2不怎么匹配，最好下源码，有什么问题直接改。 4. log4j 用eclipse新建动态网站，取名OFC2Demo，将Struts2 l
spring包详解 aijuans spring
下载的spring包中文件及各种包众多，在项目中往往只有部分是我们必须的，如果不清楚什么时候需要什么包的话，看看下面就知道了。 aspectj目录下是在Spring框架下使用aspectj的源代码和测试程序文件。Aspectj是java最早的提供AOP的应用框架。 dist 目录下是Spring 的发布包，关于发布包下面会详细进行说明。 docs&nb
网站推广之seo概念 antonyup_2006 算法 Web 应用服务器搜索引擎 Google
持续开发一年多的b2c网站终于在08年10月23日上线了。作为开发人员的我在修改bug的同时，准备了解下网站的推广分析策略。所谓网站推广，目的在于让尽可能多的潜在用户了解并访问网站，通过网站获得有关产品和服务等信息，为最终形成购买决策提供支持。网站推广策略有很多，seo，email，adv
单例模式,sql注入,序列百合不是茶单例模式序列 sql注入预编译
序列在前面写过有关的博客,也有过总结,但是今天在做一个JDBC操作数据库的相关内容时需要使用序列创建一个自增长的字段居然不会了,所以将序列写在本篇的前面 1,序列是一个保存数据连续的增长的一种方式; 序列的创建; CREATE SEQUENCE seq_pro 2 INCREMENT BY 1 -- 每次加几个 3
Mockito单元测试实例 bijian1013 单元测试 mockito
Mockito单元测试实例： public class SettingServiceTest { private List<PersonDTO> personList = new ArrayList<PersonDTO>(); @InjectMocks private SettingPojoService settin
精通Oracle10编程SQL(9)使用游标 bijian1013 oracle 数据库 plsql
/* *使用游标 */ --显示游标 --在显式游标中使用FETCH...INTO语句 DECLARE CURSOR emp_cursor is select ename,sal from emp where deptno=1; v_ename emp.ename%TYPE; v_sal emp.sal%TYPE; begin ope
【Java语言】动态代理 bit1129 java语言
JDK接口动态代理 JDK自带的动态代理通过动态的根据接口生成字节码(实现接口的一个具体类)的方式，为接口的实现类提供代理。被代理的对象和代理对象通过InvocationHandler建立关联 package com.tom; import com.tom.model.User; import com.tom.service.IUserService;
Java通信之URL通信基础白糖_ java jdk webservice 网络协议 ITeye
java对网络通信以及提供了比较全面的jdk支持，java.net包能让程序员直接在程序中实现网络通信。在技术日新月异的现在，我们能通过很多方式实现数据通信，比如webservice、url通信、socket通信等等，今天简单介绍下URL通信。学习准备：建议首先学习java的IO基础知识 URL是统一资源定位器的简写，URL可以访问Internet和www，可以通过url
博弈Java讲义 - Java线程同步 (1) boyitech java 多线程同步锁
在并发编程中经常会碰到多个执行线程共享资源的问题。例如多个线程同时读写文件，共用数据库连接，全局的计数器等。如果不处理好多线程之间的同步问题很容易引起状态不一致或者其他的错误。同步不仅可以阻止一个线程看到对象处于不一致的状态，它还可以保证进入同步方法或者块的每个线程，都看到由同一锁保护的之前所有的修改结果。处理同步的关键就是要正确的识别临界条件（cri
java-给定字符串，删除开始和结尾处的空格，并将中间的多个连续的空格合并成一个。 bylijinnan java
public class DeleteExtraSpace { /** * 题目：给定字符串，删除开始和结尾处的空格，并将中间的多个连续的空格合并成一个。 * 方法1.用已有的String类的trim和replaceAll方法 * 方法2.全部用正则表达式，这个我不熟 * 方法3.“重新发明轮子”，从头遍历一次 */ public static v
An error has occurred.See the log file错误解决！ Kai_Ge MyEclipse
今天早上打开MyEclipse时，自动关闭！弹出An error has occurred.See the log file错误提示！很郁闷昨天启动和关闭还好着！！！打开几次依然报此错误，确定不是眼花了！打开日志文件！找到当日错误文件内容： --------------------------------------------------------------------------
[矿业与工业]修建一个空间矿床开采站要多少钱? comsci
地球上的钛金属矿藏已经接近枯竭........... 我们在冥王星的一颗卫星上面发现一些具有开采价值的矿床..... 那么,现在要编制一个预算,提交给财政部门..
解析Google Map Routes dai_lm google api
为了获得从A点到B点的路劲，经常会使用Google提供的API，例如 [url] http://maps.googleapis.com/maps/api/directions/json?origin=40.7144,-74.0060&destination=47.6063,-122.3204&sensor=false [/url] 从返回的结果上，大致可以了解应该怎么走，但
SQL还有多少“理所应当”？ datamachine sql
转贴存档，原帖地址：http://blog.chinaunix.net/uid-29242841-id-3968998.html、http://blog.chinaunix.net/uid-29242841-id-3971046.html！ ------------------------------------华丽的分割线--------------------------------
Yii使用Ajax验证时，如何设置某些字段不需要验证 dcj3sjt126com Ajax yii
经常像你注册页面,你可能非常希望只需要Ajax去验证用户名和Email,而不需要使用Ajax再去验证密码,默认如果你使用Yii 内置的ajax验证Form,例如: $form=$this->beginWidget('CActiveForm', array( 'id'=>'usuario-form',&
使用git同步网站代码 dcj3sjt126com crontab git
转自:http://ued.ctrip.com/blog/?p=3646?tn=gongxinjun.com 管理一网站，最开始使用的虚拟空间，采用提供商支持的ftp上传网站文件，后换用vps，vps可以自己搭建ftp的，但是懒得搞，直接使用scp传输文件到服务器，现在需要更新文件到服务器，使用scp真的很烦。发现本人就职的公司，采用的git+rsync的方式来管理、同步代码，遂
sql基本操作蕃薯耀 sql sql基本操作 sql常用操作
sql基本操作 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年6月1日 17:30:33 星期一 &
Spring4+Hibernate4+Atomikos3.3多数据源事务管理 hanqunfeng Hibernate4
Spring3+后不再对JTOM提供支持，所以可以改用Atomikos管理多数据源事务。Spring2.5+Hibernate3+JTOM参考：http://hanqunfeng.iteye.com/blog/1554251Atomikos官网网站：http://www.atomikos.com/ 一.pom.xml <dependency> <
jquery中两个值得注意的方法one()和trigger()方法 jackyrong trigger
在jquery中，有两个值得注意但容易忽视的方法，分别是one()方法和trigger()方法,这是从国内作者<<jquery权威指南》一书中看到不错的介绍 1） one方法 one方法的功能是让所选定的元素绑定一个仅触发一次的处理函数，格式为 one(type,${data},fn) &nb
拿工资不仅仅是让你写代码的 lampcy 工作面试咨询
这是我对团队每个新进员工说的第一件事情。这句话的意思是，我并不关心你是如何快速完成任务的，哪怕代码很差，只要它像救生艇通气门一样管用就行。这句话也是我最喜欢的座右铭之一。这个说法其实很合理：我们的工作是思考客户提出的问题，然后制定解决方案。思考第一，代码第二，公司请我们的最终目的不是写代码，而是想出解决方案。话粗理不粗。付你薪水不是让你来思考的，也不是让你来写代码的，你的目的是交付产品
架构师之对象操作----------对象的效率复制和判断是否全为空 nannan408 架构师
1.前言。如题。 2.代码。 (1)对象的复制，比spring的beanCopier在大并发下效率要高，利用net.sf.cglib.beans.BeanCopier Src src=new Src(); BeanCopier beanCopier = BeanCopier.create(Src.class, Des.class, false);
ajax 被缓存的解决方案 Rainbow702 JavaScript jquery Ajax cache 缓存
使用jquery的ajax来发送请求进行局部刷新画面，各位可能都做过。今天碰到一个奇怪的现象，就是，同一个ajax请求，在chrome中，不论发送多少次，都可以发送至服务器端，而不会被缓存。但是，换成在IE下的时候，发现，同一个ajax请求，会发生被缓存的情况，只有第一次才会被发送至服务器端，之后的不会再被发送。郁闷。解决方法如下： ① 直接使用 JQuery提供的 “cache”参数，
修改date.toLocaleString()的警告 tntxia String
我们在写程序的时候，经常要查看时间，所以我们经常会用到date.toLocaleString()，但是date.toLocaleString()是一个过时的API，代替的方法如下： package com.tntxia.htmlmaker.util; import java.text.SimpleDateFormat; import java.util.
项目完成后的小总结 xiaomiya js 总结项目
项目完成了，突然想做个总结但是有点无从下手了。做之前对于客户端给的接口很模式。然而定义好了格式要求就如此的愉快了。先说说项目主要实现的功能吧 1，按键精灵 2，获取行情数据 3，各种input输入条件判断 4，发送数据（有json格式和string格式） 5，获取预警条件列表和预警结果列表， 6，排序， 7，预警结果分页获取 8，导出文件（excel，text等） 9，修