张贤同学

阿里天池 NLP 入门赛 Bert 方案 -2 Bert 源码讲解

前言

这篇文章用于记录阿里天池 NLP 入门赛，详细讲解了整个数据处理流程，以及如何从零构建一个模型，适合新手入门。

赛题以新闻数据为赛题数据，数据集报名后可见并可下载。赛题数据为新闻文本，并按照字符级别进行匿名处理。整合划分出 14 个候选分类类别：财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐的文本数据。实质上是一个 14 分类问题。

赛题数据由以下几个部分构成：训练集 20w 条样本，测试集 A 包括 5w 条样本，测试集 B 包括 5w 条样本。

比赛地址：https://tianchi.aliyun.com/competition/entrance/531810/introduction

数据可以通过上面的链接下载。

代码地址：https://github.com/zhangxiann/Tianchi-NLP-Beginner

分为 3 篇文章介绍：

数据预处理
Bert 源码讲解
Bert 预训练与分类

在上一篇文章中，我们介绍了数据预处理的流程。

这篇文章主要讲解 Bert 源码。

Bert 代码讲解

Bert 模型来自于 Transformer 的编码器，主要代码是 modeling.py。

BertConfig

我们先看下 BertConfig 类，这个类用于配置 BertModel 的超参数。用于读取 bert_config.json 文件中的参数。

bert_config.json 中的参数以及解释如下：

{
  "hidden_size": 256,                   # 隐藏层的输出向量长度
  "hidden_act": "gelu",                 # 激活函数
  "initializer_range": 0.02,            # 参数初始化的范围
  "vocab_size": 5981,                   # 字典大小
  "hidden_dropout_prob": 0.1,           # 隐藏层 dropout 概率
  "num_attention_heads": 4,             # 每个 隐藏层 的 attention head 个数
  "type_vocab_size": 2,                 # segment_ids类别 [0,1]，句子个数
  "max_position_embeddings": 256,       #一个大于seq_length的参数，用于生成position_embedding
  "num_hidden_layers": 4,               # 隐藏层数量
  "intermediate_size": 1024,            # 前馈神经网络中的升维维度
  "attention_probs_dropout_prob": 0.1   # 计算 attention 时，softmax 后 dropout 概率
}

BertModel

BertModel 的输入参数如下：

config：BertConfig 对象，用于配置超参数
input_ids：经过 mask 的 token，形状是 [batch_size, seq_length]
input_mask：句子的 mask 列表，1 表示对应的位置有 token，0 表示对应的位置没有 token，形状是 [batch_size, seq_length]
token_type_ids：句子 id，形状是 [batch_size, seq_length]
use_one_hot_embeddings：是否使用 one hot 来获取 embedding。如果词典比较大，使用 one hot 比较快；如果词典比较小，不使用 one hot 比较快。

主要流程如下：

调用 embedding_lookup()，把 input_ids 转换为 embedding，返回 embedding_output，embedding_table。
调用 embedding_postprocessor()，将 embedding_output 添加位置编码和句子 id 的编码。
调用 create_attention_mask_from_input_mask()，根据 input_mask 创建 attention_mask。
调用 transformer_model，创建 Bert 模型，输入数据，得到最后一层 encoder 的输出 sequence_output。
取出 sequence_output 中第一个 token 的输出，经过全连接层，得到 pooled_output。

class BertModel(object):
    """BERT model ("Bidirectional Encoder Representations from Transformers").

    Example usage:

input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
# input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
# token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])
#
# config = modeling.BertConfig(vocab_size=32000, hidden_size=512,
#                              num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)
#
# model = modeling.BertModel(config=config, is_training=True,
#                            input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)
#
# label_embeddings = tf.get_variable(...)
# pooled_output = model.get_pooled_output()
# logits = tf.matmul(pooled_output, label_embeddings)
# ...



    def __init__(self,
                 config,
                 is_training,
                 input_ids,
                 input_mask=None,
                 token_type_ids=None,
                 use_one_hot_embeddings=False,
                 scope=None):
        """Constructor for BertModel.

        Args:
          config: `BertConfig` instance.
          is_training: bool. true for training model, false for eval model. Controls
            whether dropout will be applied.
          input_ids: int32 Tensor of shape [batch_size, seq_length].
          input_mask: (optional) int32 Tensor of shape [batch_size, seq_length].
          token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
          use_one_hot_embeddings: (optional) bool. Whether to use one-hot word
            embeddings or tf.embedding_lookup() for the word embeddings.
          scope: (optional) variable scope. Defaults to "bert".

        Raises:
          ValueError: The config is invalid or one of the input tensor shapes
            is invalid.
        """
        config = copy.deepcopy(config)
        # 如果不是训练，那么就是测试，取消 dropout
        if not is_training:
            config.hidden_dropout_prob = 0.0
            config.attention_probs_dropout_prob = 0.0

        # 把 shape 转为 list
        input_shape = get_shape_list(input_ids, expected_rank=2)
        batch_size = input_shape[0]
        seq_length = input_shape[1]

        if input_mask is None:
            input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32)

        if token_type_ids is None:
            token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

        with tf.variable_scope(scope, default_name="bert"):
            with tf.variable_scope("embeddings"):
                # Perform embedding lookup on the word ids.
                # embedding_output: [batch_size, seq_length, hidden_size]
                # embedding_table: [vocab_size, hidden_size]
                (self.embedding_output, self.embedding_table) = embedding_lookup(
                    input_ids=input_ids,
                    vocab_size=config.vocab_size,
                    embedding_size=config.hidden_size,
                    initializer_range=config.initializer_range,
                    word_embedding_name="word_embeddings",
                    use_one_hot_embeddings=use_one_hot_embeddings)

                # Add positional embeddings and token type embeddings, then layer
                # normalize and perform dropout.
                # 添加位置的 embedding
                # embedding_output: [batch_size, seq_length, hidden_size]
                self.embedding_output = embedding_postprocessor(
                    input_tensor=self.embedding_output,
                    use_token_type=True,
                    token_type_ids=token_type_ids,
                    token_type_vocab_size=config.type_vocab_size,
                    token_type_embedding_name="token_type_embeddings",
                    use_position_embeddings=True,
                    position_embedding_name="position_embeddings",
                    initializer_range=config.initializer_range,
                    max_position_embeddings=config.max_position_embeddings,
                    dropout_prob=config.hidden_dropout_prob)

            with tf.variable_scope("encoder"):
                # This converts a 2D mask of shape [batch_size, seq_length] to a 3D
                # mask of shape [batch_size, seq_length, seq_length] which is used
                # for the attention scores.
                # attention_mask: [batch_size, seq_length, seq_length]
                attention_mask = create_attention_mask_from_input_mask(
                    input_ids, input_mask)

                # Run the stacked transformer.
                # `sequence_output` shape = [batch_size, seq_length, hidden_size].
                self.all_encoder_layers = transformer_model(
                    input_tensor=self.embedding_output,
                    attention_mask=attention_mask,
                    hidden_size=config.hidden_size,
                    num_hidden_layers=config.num_hidden_layers,
                    num_attention_heads=config.num_attention_heads,
                    intermediate_size=config.intermediate_size,
                    intermediate_act_fn=get_activation(config.hidden_act),
                    hidden_dropout_prob=config.hidden_dropout_prob,
                    attention_probs_dropout_prob=config.attention_probs_dropout_prob,
                    initializer_range=config.initializer_range,
                    do_return_all_layers=True)

            # 取出最后一层 encoder 的输出
            self.sequence_output = self.all_encoder_layers[-1]
            # The "pooler" converts the encoded sequence tensor of shape
            # [batch_size, seq_length, hidden_size] to a tensor of shape
            # [batch_size, hidden_size]. This is necessary for segment-level
            # (or segment-pair-level) classification tasks where we need a fixed
            # dimensional representation of the segment.
            with tf.variable_scope("pooler"):
                # We "pool" the model by simply taking the hidden state corresponding
                # to the first token. We assume that this has been pre-trained
                # 取出第一个 token 对应的 输出，也就是 [CLS] 对应的输出
                first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1)
                # 经过全连接层，得到最终的输出
                self.pooled_output = tf.layers.dense(
                    first_token_tensor,
                    config.hidden_size,
                    activation=tf.tanh,
                    kernel_initializer=create_initializer(config.initializer_range))

    # 取出最后一层 encoder ，第一个 token 的输出
    def get_pooled_output(self):
        return self.pooled_output

    # 获取最后一层 encoder 的 output
    def get_sequence_output(self):
        """Gets final hidden layer of encoder.

        Returns:
          float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
          to the final hidden of the transformer encoder.
        """
        return self.sequence_output

    # 获取所有隐藏层的输出
    def get_all_encoder_layers(self):
        return self.all_encoder_layers

    # 返回经过 positional embedding、token type embedding、以及 layer norm 的数据
    # [batch_size, seq_length, hidden_size]
    def get_embedding_output(self):
        """Gets output of the embedding lookup (i.e., input to the transformer).

        Returns:
          float Tensor of shape [batch_size, seq_length, hidden_size] corresponding
          to the output of the embedding layer, after summing the word
          embeddings with the positional embeddings and the token type embeddings,
          then performing layer normalization. This is the input to the transformer.
        """
        return self.embedding_output

    # 返回词向量的矩阵: [vocab_size, hidden_size]
    def get_embedding_table(self):
        return self.embedding_table

代码注释中给出了使用 BertModel 的 demo，如下所示：

input_ids = tf.constant([[31, 51, 99], [15, 5, 0]])
input_mask = tf.constant([[1, 1, 1], [1, 1, 0]])
token_type_ids = tf.constant([[0, 0, 1], [0, 2, 0]])

config = modeling.BertConfig(vocab_size=32000, hidden_size=512, num_hidden_layers=8, num_attention_heads=6, intermediate_size=1024)

model = modeling.BertModel(config=config, is_training=True, input_ids=input_ids, input_mask=input_mask, token_type_ids=token_type_ids)

label_embeddings = tf.get_variable(...)
pooled_output = model.get_pooled_output()
logits = tf.matmul(pooled_output, label_embeddings)

embedding_lookup()

embedding_lookup() 的作用是根据 input_ids，返回对应的词向量矩阵 embedding_output ，和整个词典的词向量矩阵 embedding_table。

主要流程如下：

首先获取词典的 embedding 矩阵 embedding_table。
判断是否使用 one hot，根据 input_ids 获取对应的 ouput。
返回 ouput 和 embedding_table。

def embedding_lookup(input_ids,
                     vocab_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
    """Looks up words embeddings for id tensor.

    Args:
      input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
        ids.
      vocab_size: int. Size of the embedding vocabulary.
      embedding_size: int. Width of the word embeddings.
      initializer_range: float. Embedding initialization range.
      word_embedding_name: string. Name of the embedding table.
      use_one_hot_embeddings: bool. If True, use one-hot method for word
        embeddings. If False, use `tf.gather()`.

    Returns:
      float Tensor of shape [batch_size, seq_length, embedding_size].
    """
    # This function assumes that the input is of shape [batch_size, seq_length,
    # num_inputs].
    #
    # If the input is a 2D tensor of shape [batch_size, seq_length], we
    # reshape to [batch_size, seq_length, 1].
    if input_ids.shape.ndims == 2:
        input_ids = tf.expand_dims(input_ids, axis=[-1])

    # 获取 embedding table: [vocab_size, embedding_size]
    embedding_table = tf.get_variable(
        name=word_embedding_name,
        shape=[vocab_size, embedding_size],
        initializer=create_initializer(initializer_range))

    # flat_input_ids: [batch_size * seq_length, 1]
    flat_input_ids = tf.reshape(input_ids, [-1])
    if use_one_hot_embeddings:
        # # 如果是 one hot，直接相乘，就可以得到 embedding
        one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
        output = tf.matmul(one_hot_input_ids, embedding_table)
    else:
        # 如果不是 one hot，那么根据 index 取出 embedding
        output = tf.gather(embedding_table, flat_input_ids)

    input_shape = get_shape_list(input_ids)
    # output: [batch_size * seq_length, embedding_size]
    # -> [batch_size, seq_length, embedding_size]
    output = tf.reshape(output,
                        input_shape[0:-1] + [input_shape[-1] * embedding_size])
    return (output, embedding_table)

embedding_postprocessor()

embedding_postprocessor() 的作用是将 embedding_output 添加位置编码和句子 id 的编码。

其中 Token Embeddings 对应 input_tensor；Segment Embedding 对应 token_type_embeddings；Position Embedding 对应 position_embeddings，3 者相加得到最终的输出。

流程如下：

首先把 input_tensor 赋值给 output。
根据 token_type_ids 生成 token_type_embeddings，表示句子 id 编码，加到 output。
生成 position_embeddings，表示位置编码，加到 output。代码中 position_embeddings 部分与论文中的方法不同。此代码中 position_embeddings 是训练出来的，而论文中的 position_embeddings 是固定值： $\begin{aligned} P E_{(p o s, 2 i)} &=\sin \left(p o s / 10000^{2 i / d_{\text {model }}}\right), P E_{(p o s, 2 i+1)} =\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \end{aligned}$ 。

def embedding_postprocessor(input_tensor,
                            use_token_type=False,
                            token_type_ids=None,
                            token_type_vocab_size=16,
                            token_type_embedding_name="token_type_embeddings",
                            use_position_embeddings=True,
                            position_embedding_name="position_embeddings",
                            initializer_range=0.02,
                            max_position_embeddings=512,
                            dropout_prob=0.1):
    """Performs various post-processing on a word embedding tensor.

    Args:
      input_tensor: float Tensor of shape [batch_size, seq_length,
        embedding_size].
      use_token_type: bool. Whether to add embeddings for `token_type_ids`.
      token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
        Must be specified if `use_token_type` is True.
      token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
      token_type_embedding_name: string. The name of the embedding table variable
        for token type ids.
      use_position_embeddings: bool. Whether to add position embeddings for the
        position of each token in the sequence.
      position_embedding_name: string. The name of the embedding table variable
        for positional embeddings.
      initializer_range: float. Range of the weight initialization.
      max_position_embeddings: int. Maximum sequence length that might ever be
        used with this model. This can be longer than the sequence length of
        input_tensor, but cannot be shorter.
      dropout_prob: float. Dropout probability applied to the final output tensor.

    Returns:
      float tensor with same shape as `input_tensor`.

    Raises:
      ValueError: One of the tensor shapes or input values is invalid.
    """
    input_shape = get_shape_list(input_tensor, expected_rank=3)
    batch_size = input_shape[0]
    seq_length = input_shape[1]
    # width 就是 embedding_size
    width = input_shape[2]

    # output: [batch_size, seq_length, embedding_size]
    output = input_tensor

    # 添加句子 id 的编码
    if use_token_type:
        if token_type_ids is None:
            raise ValueError("`token_type_ids` must be specified if"
                             "`use_token_type` is True.")
        # token_type_table: [token_type_vocab_size, embedding_size]
        token_type_table = tf.get_variable(
            name=token_type_embedding_name,
            shape=[token_type_vocab_size, width],
            initializer=create_initializer(initializer_range))
        # This vocab will be small so we always do one-hot here, since it is always
        # faster for a small vocabulary.
        # flat_token_type_ids: [batch_size * seq_length]
        flat_token_type_ids = tf.reshape(token_type_ids, [-1])
        # one_hot_ids: [batch_size * seq_length, token_type_vocab_size]
        one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
        # 使用 one hot ,相乘得到 token_type_embeddings
        token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
        token_type_embeddings = tf.reshape(token_type_embeddings,
                                           [batch_size, seq_length, width])
        # 相加
        # output: [batch_size, seq_length, embedding_size]
        output += token_type_embeddings

    if use_position_embeddings:
        assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
        with tf.control_dependencies([assert_op]):
            # full_position_embeddings: [max_position_embeddings, embedding_size]
            full_position_embeddings = tf.get_variable(
                name=position_embedding_name,
                shape=[max_position_embeddings, width],
                initializer=create_initializer(initializer_range))
            # Since the position embedding table is a learned variable, we create it
            # using a (long) sequence length `max_position_embeddings`. The actual
            # sequence length might be shorter than this, for faster training of
            # tasks that do not have long sequences.
            #
            # So `full_position_embeddings` is effectively an embedding table
            # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
            # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
            # perform a slice.
            # 由于 full_position_embeddings 的长度是 max_position_embeddings，
            # 而实际的句子长度是 seq_length。因此需要 slice 裁剪
            # position_embeddings: [seq_length, embedding_size]
            position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                           [seq_length, -1])
            num_dims = len(output.shape.as_list())

            # Only the last two dimensions are relevant (`seq_length` and `width`), so
            # we broadcast among the first dimensions, which is typically just
            # the batch size.
            position_broadcast_shape = []
            for _ in range(num_dims - 2):
                position_broadcast_shape.append(1)
            # position_broadcast_shape: [1, seq_length, embedding_size]
            position_broadcast_shape.extend([seq_length, width])
            position_embeddings = tf.reshape(position_embeddings,
                                             position_broadcast_shape)
            # 相加
            # output: [batch_size, seq_length, embedding_size]
            output += position_embeddings
    # layer norm
    output = layer_norm_and_dropout(output, dropout_prob)
    return output

create_attention_mask_from_input_mask()

create_attention_mask_from_input_mask() 的作用是根据 input_mask 创建 attention_mask，attention_mask 的形状是 [batch_size, from_seq_length, seq_length]。

# 从 2D 的 from_tensor
def create_attention_mask_from_input_mask(from_tensor, to_mask):
    """Create 3D attention mask from a 2D tensor mask.

    Args:
      from_tensor: 2D or 3D Tensor of shape [batch_size, from_seq_length, ...].
      to_mask: int32 Tensor of shape [batch_size, to_seq_length].

    Returns:
      float Tensor of shape [batch_size, from_seq_length, to_seq_length].
    """
    # from_tensor: [batch_size, seq_length]
    from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
    batch_size = from_shape[0]
    from_seq_length = from_shape[1]

    # to_mask: [batch_size, seq_length]
    to_shape = get_shape_list(to_mask, expected_rank=2)
    # to_seq_length: seq_length
    to_seq_length = to_shape[1]
    # to_mask: [batch_size, 1, seq_length]
    to_mask = tf.cast(
        tf.reshape(to_mask, [batch_size, 1, to_seq_length]), tf.float32)

    # We don't assume that `from_tensor` is a mask (although it could be). We
    # don't actually care if we attend *from* padding tokens (only *to* padding)
    # tokens so we create a tensor of all ones.
    #
    # `broadcast_ones` = [batch_size, from_seq_length, 1]
    broadcast_ones = tf.ones(
        shape=[batch_size, from_seq_length, 1], dtype=tf.float32)

    # Here we broadcast along two dimensions to create the mask.
    # mask: [batch_size, from_seq_length, seq_length]
    mask = broadcast_ones * to_mask

    return mask

transformer_model

transformer_model 就是 Bert 模型的真正实现，模型结构就是下图中的 左边部分。

主要流程如下：

输入 input_tensor 的形状是 [batch_size, seq_length, hidden_size]，转换为 prev_output，形状是 [batch_size * seq_length, hidden_size]。
进入 for 循环，经过每一层 encoder：
- 输入 attention_layer，得到 attention_head。attention_layer 的作用是计算 Self Attention。
- attention_head 经过一个全连接层和 layer_norm 层，得到 attention_output。
- attention_output 经过一个升维的全连接层和降维的全连接层，得到 layer_output。
- 将 attention_output 和 layer_output 相加作为残差连接

def transformer_model(input_tensor,
                      attention_mask=None,
                      hidden_size=768,
                      num_hidden_layers=12,
                      num_attention_heads=12,
                      intermediate_size=3072,
                      intermediate_act_fn=gelu,
                      hidden_dropout_prob=0.1,
                      attention_probs_dropout_prob=0.1,
                      initializer_range=0.02,
                      do_return_all_layers=False):
    """Multi-headed, multi-layer Transformer from "Attention is All You Need".

    This is almost an exact implementation of the original Transformer encoder.

    See the original paper:
    https://arxiv.org/abs/1706.03762

    Also see:
    https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py

    Args:
      input_tensor: float Tensor of shape [batch_size, seq_length, hidden_size].
      attention_mask: (optional) int32 Tensor of shape [batch_size, seq_length,
        seq_length], with 1 for positions that can be attended to and 0 in
        positions that should not be.
      hidden_size: int. Hidden size of the Transformer.
      num_hidden_layers: int. Number of layers (blocks) in the Transformer.
      num_attention_heads: int. Number of attention heads in the Transformer.
      intermediate_size: int. The size of the "intermediate" (a.k.a., feed
        forward) layer.
      intermediate_act_fn: function. The non-linear activation function to apply
        to the output of the intermediate/feed-forward layer.
      hidden_dropout_prob: float. Dropout probability for the hidden layers.
      attention_probs_dropout_prob: float. Dropout probability of the attention
        probabilities.
      initializer_range: float. Range of the initializer (stddev of truncated
        normal).
      do_return_all_layers: Whether to also return all layers or just the final
        layer.

    Returns:
      float Tensor of shape [batch_size, seq_length, hidden_size], the final
      hidden layer of the Transformer.

    Raises:
      ValueError: A Tensor shape or parameter is invalid.
    """
    # 如果 hidden_size 不能整除 num_attention_heads，那么抛出异常
    if hidden_size % num_attention_heads != 0:
        raise ValueError(
            "The hidden size (%d) is not a multiple of the number of attention "
            "heads (%d)" % (hidden_size, num_attention_heads))

    # 计算每组 attention 的 size
    attention_head_size = int(hidden_size / num_attention_heads)
    input_shape = get_shape_list(input_tensor, expected_rank=3)
    batch_size = input_shape[0]
    seq_length = input_shape[1]
    # input_width 就是 hidden_size
    input_width = input_shape[2]

    # The Transformer performs sum residuals on all layers so the input needs
    # to be the same as the hidden size.
    # 由于后面需要做 residuals 操作，因此 hidden_size 需要等于 input_width
    if input_width != hidden_size:
        raise ValueError("The width of the input tensor (%d) != hidden size (%d)" %
                         (input_width, hidden_size))

    # We keep the representation as a 2D tensor to avoid re-shaping it back and
    # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on
    # the GPU/CPU but may not be free on the TPU, so we want to minimize them to
    # help the optimizer.
    # prev_output: [batch_size * seq_length, hidden_size]
    prev_output = reshape_to_matrix(input_tensor)

    all_layer_outputs = []
    for layer_idx in range(num_hidden_layers):
    # 循环多层 encoder
        with tf.variable_scope("layer_%d" % layer_idx):
            # 获取前一层的输出
            layer_input = prev_output

            with tf.variable_scope("attention"):
                attention_heads = []
                with tf.variable_scope("self"):
                    # 计算 Self Attention
                    # attention_head: [batch_size * seq_length, hidden_size]
                    attention_head = attention_layer(
                        from_tensor=layer_input,
                        to_tensor=layer_input,
                        attention_mask=attention_mask,
                        num_attention_heads=num_attention_heads,
                        size_per_head=attention_head_size,
                        attention_probs_dropout_prob=attention_probs_dropout_prob,
                        initializer_range=initializer_range,
                        do_return_2d_tensor=True,
                        batch_size=batch_size,
                        from_seq_length=seq_length,
                        to_seq_length=seq_length)
                    attention_heads.append(attention_head)

                attention_output = None
                if len(attention_heads) == 1:
                    attention_output = attention_heads[0]
                else:
                    # In the case where we have other sequences, we just concatenate
                    # them to the self-attention head before the projection.
                    attention_output = tf.concat(attention_heads, axis=-1)

                # Run a linear projection of `hidden_size` then add a residual
                # with `layer_input`.
                with tf.variable_scope("output"):
                    # 普通的全连接层
                    # attention_output: [batch_size * seq_length, hidden_size]
                    attention_output = tf.layers.dense(
                        attention_output,
                        hidden_size,
                        kernel_initializer=create_initializer(initializer_range))
                    attention_output = dropout(attention_output, hidden_dropout_prob)
                    # layer norm 层
                    attention_output = layer_norm(attention_output + layer_input)

            # The activation is only applied to the "intermediate" hidden layer.
            with tf.variable_scope("intermediate"): # tensor
                # 升维全连接
                # intermediate_output: [batch_size * seq_length, intermediate_size]
                intermediate_output = tf.layers.dense(
                    attention_output,
                    intermediate_size,
                    activation=intermediate_act_fn,
                    kernel_initializer=create_initializer(initializer_range))

            # Down-project back to `hidden_size` then add the residual.
            with tf.variable_scope("output"):
                # 降维全连接
                # layer_output: [batch_size * seq_length, hidden_size]
                layer_output = tf.layers.dense(
                    intermediate_output,
                    hidden_size,
                    kernel_initializer=create_initializer(initializer_range))
                layer_output = dropout(layer_output, hidden_dropout_prob)
                # 相加，作为残差连接
                layer_output = layer_norm(layer_output + attention_output)
                # 这一层的输出，作为下一层的输入
                prev_output = layer_output
                # 将这一层的输出保存到结果列表
                all_layer_outputs.append(layer_output)

    if do_return_all_layers:
        # 返回所有层的输出
        final_outputs = []
        for layer_output in all_layer_outputs:
            final_output = reshape_from_matrix(layer_output, input_shape)
            final_outputs.append(final_output)
        return final_outputs
    else:
        # 返回最后一层的输出
        final_output = reshape_from_matrix(prev_output, input_shape)
        return final_output

其中 attention_layer 真正实现了 Self Attention 的计算。

attention_layer()

attention_layer() 的作用就是计算 Self Attention。

主要输入参数有 2 个：from_tensor 和 to_tensor。

如果from_tensor 和 to_tensor 是一样的，那么表示计算 encoder 的 Attention；否则计算 decoder 的 Attention。

主要流程如下：

首先，把 from_tensor 转换为 from_tensor_2d，形状为 [batch_size * from_seq_length, from_width]。
把 to_tensor 转换为 to_tensor_2d，形状为 [batch_size * to_seq_length, to_width]。
首先根据 from_tensor，计算：query_layer( query 矩阵)，形状为 [batch_size * from_seq_length, num_attention_heads * size_per_head]。
然后根据 to_tensor，计算 key_layer ( key 矩阵)和 value_layer(value 矩阵)，形状为 [batch_size * to_seq_length, num_attention_heads * size_per_head]。
将 query_layer 的形状转换为 [batch_size, num_attention_heads, from_seq_length, size_per_head]。
将 key_layer 和 value_layer 的形状转换为 [batch_size, num_attention_heads, to_seq_length, size_per_head]。
将 query_layer 和 key_layer 相乘，得到 attention_scores，并除以 $\sqrt{size\_per\_head}$ 做缩放，形状为 [batch_size, num_attention_heads, from_seq_length, to_seq_length]。
将 attention_scores 根据 attention_mask 进行处理。这里采用了一个巧妙的处理：将 attention_mask 构造 adder，其中 mask 为 1 的地方，adder 为 0；mask 为 0 的地方，adder 为 -1000。最后将 attention_scores 和 adder 相加。这样，mask 为 1 对应的 attention 会不变，而 mask 为 0 对应的 attention 会变得非常小。
attention_scores 经过 softmax，mask 为 0 对应的 attention 会接近于 0，达到了 mask 的目的，得到 attention_probs，再经过 dropout。
将 attention_probs 和 value_layer 相乘，得到 context_layer，形状是 [batch_size, num_attention_heads, from_seq_length, size_per_head]。
最后将 context_layer 的形状转换为 [batch_size, from_seq_length, num_attention_heads * size_per_head]，返回 context_layer 。

def attention_layer(from_tensor,
                    to_tensor,
                    attention_mask=None,
                    num_attention_heads=1,
                    size_per_head=512,
                    query_act=None,
                    key_act=None,
                    value_act=None,
                    attention_probs_dropout_prob=0.0,
                    initializer_range=0.02,
                    do_return_2d_tensor=False,
                    batch_size=None,
                    from_seq_length=None,
                    to_seq_length=None):
    """Performs multi-headed attention from `from_tensor` to `to_tensor`.

    This is an implementation of multi-headed attention based on "Attention
    is all you Need". If `from_tensor` and `to_tensor` are the same, then
    this is self-attention. Each timestep in `from_tensor` attends to the
    corresponding sequence in `to_tensor`, and returns a fixed-with vector.

    This function first projects `from_tensor` into a "query" tensor and
    `to_tensor` into "key" and "value" tensors. These are (effectively) a list
    of tensors of length `num_attention_heads`, where each tensor is of shape
    [batch_size, seq_length, size_per_head].

    Then, the query and key tensors are dot-producted and scaled. These are
    softmaxed to obtain attention probabilities. The value tensors are then
    interpolated by these probabilities, then concatenated back to a single
    tensor and returned.

    In practice, the multi-headed attention are done with transposes and
    reshapes rather than actual separate tensors.

    Args:
      from_tensor: float Tensor of shape [batch_size, from_seq_length,
        from_width].
      to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width].
      attention_mask: (optional) int32 Tensor of shape [batch_size,
        from_seq_length, to_seq_length]. The values should be 1 or 0. The
        attention scores will effectively be set to -infinity for any positions in
        the mask that are 0, and will be unchanged for positions that are 1.
      num_attention_heads: int. Number of attention heads.
      size_per_head: int. Size of each attention head.
      query_act: (optional) Activation function for the query transform.
      key_act: (optional) Activation function for the key transform.
      value_act: (optional) Activation function for the value transform.
      attention_probs_dropout_prob: (optional) float. Dropout probability of the
        attention probabilities.
      initializer_range: float. Range of the weight initializer.
      do_return_2d_tensor: bool. If True, the output will be of shape [batch_size
        * from_seq_length, num_attention_heads * size_per_head]. If False, the
        output will be of shape [batch_size, from_seq_length, num_attention_heads
        * size_per_head].
      batch_size: (Optional) int. If the input is 2D, this might be the batch size
        of the 3D version of the `from_tensor` and `to_tensor`.
      from_seq_length: (Optional) If the input is 2D, this might be the seq length
        of the 3D version of the `from_tensor`.
      to_seq_length: (Optional) If the input is 2D, this might be the seq length
        of the 3D version of the `to_tensor`.

    Returns:
      float Tensor of shape [batch_size, from_seq_length,
        num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is
        true, this will be of shape [batch_size * from_seq_length,
        num_attention_heads * size_per_head]).

    Raises:
      ValueError: Any of the arguments or tensor shapes are invalid.
    """

    def transpose_for_scores(input_tensor, batch_size, num_attention_heads,
                             seq_length, width):
        output_tensor = tf.reshape(
            input_tensor, [batch_size, seq_length, num_attention_heads, width])

        output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3])
        return output_tensor

    from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])
    to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])

    if len(from_shape) != len(to_shape):
        raise ValueError(
            "The rank of `from_tensor` must match the rank of `to_tensor`.")

    if len(from_shape) == 3:
        batch_size = from_shape[0]
        from_seq_length = from_shape[1]
        to_seq_length = to_shape[1]
    elif len(from_shape) == 2:
        if (batch_size is None or from_seq_length is None or to_seq_length is None):
            raise ValueError(
                "When passing in rank 2 tensors to attention_layer, the values "
                "for `batch_size`, `from_seq_length`, and `to_seq_length` "
                "must all be specified.")

    # Scalar dimensions referenced here:
    #   B = batch size (number of sequences)
    #   F = `from_tensor` sequence length
    #   T = `to_tensor` sequence length
    #   N = `num_attention_heads`
    #   H = `size_per_head`

    # 把 from_tensor 转换为 [batch_size * from_seq_length, from_width]
    from_tensor_2d = reshape_to_matrix(from_tensor)
    # 把 to_tensor 转换为 [batch_size * to_seq_length, to_width]
    to_tensor_2d = reshape_to_matrix(to_tensor)

    # query_layer: [batch_size * from_seq_length, num_attention_heads * size_per_head]
    query_layer = tf.layers.dense(
        from_tensor_2d,
        num_attention_heads * size_per_head,
        activation=query_act,
        name="query",
        kernel_initializer=create_initializer(initializer_range))

    # key_layer: [batch_size * to_seq_length, num_attention_heads * size_per_head]
    key_layer = tf.layers.dense(
        to_tensor_2d,
        num_attention_heads * size_per_head,
        activation=key_act,
        name="key",
        kernel_initializer=create_initializer(initializer_range))

    # key_layer: [batch_size * to_seq_length, num_attention_heads * size_per_head]
    value_layer = tf.layers.dense(
        to_tensor_2d,
        num_attention_heads * size_per_head,
        activation=value_act,
        name="value",
        kernel_initializer=create_initializer(initializer_range))

    # `query_layer` = [B, N, F, H]
    query_layer = transpose_for_scores(query_layer, batch_size,
                                       num_attention_heads, from_seq_length,
                                       size_per_head)

    # `key_layer` = [B, N, T, H]
    key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,
                                     to_seq_length, size_per_head)

    # Take the dot product between "query" and "key" to get the raw
    # attention scores.
    # attention_scores: [batch_size, num_attention_heads, from_seq_length, to_seq_length]
    attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True)
    # 除以 sqrt(d)，缩放
    attention_scores = tf.multiply(attention_scores,
                                   1.0 / math.sqrt(float(size_per_head)))

    # 对 attention 进行 mask
    if attention_mask is not None:
        # `attention_mask` = [B, 1, F, T]
        attention_mask = tf.expand_dims(attention_mask, axis=[1])

        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
        # masked positions, this operation will create a tensor which is 0.0 for
        # positions we want to attend and -10000.0 for masked positions.
        adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0

        # Since we are adding it to the raw scores before the softmax, this is
        # effectively the same as removing these entirely.
        attention_scores += adder

    # Normalize the attention scores to probabilities.
    # `attention_probs` = [B, N, F, T]
    # 对 attention  进行 softmax
    attention_probs = tf.nn.softmax(attention_scores)

    # This is actually dropping out entire tokens to attend to, which might
    # seem a bit unusual, but is taken from the original Transformer paper.
    # 对 attention  进行 dropout
    attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

    # `value_layer` = [B, T, N, H]
    value_layer = tf.reshape(
        value_layer,
        [batch_size, to_seq_length, num_attention_heads, size_per_head])

    # `value_layer` = [B, N, T, H]
    value_layer = tf.transpose(value_layer, [0, 2, 1, 3])

    # `context_layer` = [B, N, F, H]
    # 将 attention 乘以 value_layer，得到 context_layer
    context_layer = tf.matmul(attention_probs, value_layer)

    # `context_layer` = [B, F, N, H]
    context_layer = tf.transpose(context_layer, [0, 2, 1, 3])

    if do_return_2d_tensor:
        # `context_layer` = [B*F, N*H]
        context_layer = tf.reshape(
            context_layer,
            [batch_size * from_seq_length, num_attention_heads * size_per_head])
    else:
        # `context_layer` = [batch_size, from_seq_length, num_attention_heads * size_per_head]
        context_layer = tf.reshape(
            context_layer,
            [batch_size, from_seq_length, num_attention_heads * size_per_head])

    return context_layer

至此，Bert 源码就讲解完了。在下一篇文章中，我们会讲解如何预训练 Bert，以及使用训练好的 Bert 模型进行分类。

如果你有疑问，欢迎留言。

参考

谷歌 BERT 预训练源码解析
阿里天池 NLP 入门赛 TextCNN 方案代码详细注释和流程讲解

如果你觉得这篇文章对你有帮助，不妨点个赞，让我有更多动力写出好文章。

你可能感兴趣的:(数据竞赛,自然语言处理,神经网络,数据挖掘,深度学习)

AI人工智能代理工作流AI Agent WorkFlow：搭建可拓展的AI代理工作流架构 AI天才研究院 AI大模型企业级应用开发实战 DeepSeek R1 &大数据AI人工智能大模型大厂Offer收割机面试题简历程序员读书硅基计算碳基计算认知计算生物计算深度学习神经网络大数据 AIGC AGI LLM Java Python 架构设计 Agent 程序员实现财富自由
AI人工智能代理工作流AIAgentWorkFlow：搭建可拓展的AI代理工作流架构1.背景介绍1.1问题的由来随着人工智能技术的飞速发展，特别是机器学习和深度学习技术的广泛应用，构建高度智能且自动化的代理系统成为了一个迫切的需求。这些代理系统能够自主地进行决策、执行任务并适应不断变化的环境。然而，现有的代理系统往往在面对复杂任务时缺乏灵活性和可扩展性，这限制了它们在实际应用中的广泛部署和大规模应
Java 中 VO、POJO、DTO 的区别详解 ♢.＊ java 开发语言
亲爱的小伙伴们，在求知的漫漫旅途中，若你对深度学习的奥秘、Java与Python的奇妙世界，亦或是读研论文的撰写攻略有所探寻，那不妨给我一个小小的关注吧。我会精心筹备，在未来的日子里不定期地为大家呈上这些领域的知识宝藏与实用经验分享。每一个点赞，都如同春日里的一缕阳光，给予我满满的动力与温暖，让我们在学习成长的道路上相伴而行，共同进步✨。期待你的关注与点赞哟！在Java开发的广阔领域中，准确理解和
医院信息科医疗语言大模型开发的风险洞察与避坑策略 Allen_LVyingbo 医疗高效编程研发健康医疗人工智能互联网医院 python 开源
一、引言1.1研究背景与意义在数字化医疗快速发展的当下，医疗AI技术已成为推动医疗行业变革的核心力量。其中，医疗语言大模型作为自然语言处理技术在医疗领域的深度应用，正逐渐改变着医疗服务的模式与效率。从辅助医生进行疾病诊断、提供临床决策支持，到助力医学文献分析、药物研发等，医疗语言大模型展现出了巨大的应用潜力。例如，在疾病诊断环节，大语言模型可以通过对患者症状、病史等文本信息的分析，快速给出可能的疾
树莓集团现状最新进展：宜宾园区业务再添新篇树莓集团百度人工智能科技大数据媒体
树莓集团在不断发展的进程中，宜宾园区传来了最新进展，业务再添新篇。近期，树莓集团宜宾园区在人工智能领域取得了重大突破。园区内的研发团队成功研发出一款适用于工业检测的人工智能视觉系统。该系统利用深度学习算法，能够快速、准确地检测出工业产品表面的细微缺陷，检测精度比传统检测方法提高了30%。这一成果不仅提升了宜宾园区在智能制造领域的竞争力，还为当地的制造业企业提供了更先进的质量检测手段。目前，已有多家
Deepseek 使用指南与提问优化策略西瓜拍两瓣 ai 语言模型 python gpt
序言随着人工智能技术的迅猛发展，语义搜索已成为提升信息检索效率和用户体验的核心工具。DeepSeek作为一款先进的语义搜索引擎，通过自然语言处理（NLP）和机器学习技术，能够深入理解用户查询的语义意图，提供高度精准的搜索结果。本文将详细介绍DeepSeek的核心功能、集成方法，并深入探讨如何通过优化提问策略，最大化利用DeepSeek的语义搜索能力，从而提升信息检索的效率和准确性。访问DeepSe
Python爬虫岱宗夫up 教学 python 爬虫开发语言
python凭借其简洁的语法和强大的库支持，成为编写爬虫程序的首选语言之一。今天，我将通过一个简单的示例，带你入门Python爬虫，并展示如何爬取网页内容并保存到文本文件中。一、爬虫的基本概念爬虫（WebCrawler）是一种自动获取网页内容的程序。它模拟浏览器的行为，向目标网站发送请求，获取网页的HTML代码，然后通过解析HTML提取所需的数据。爬虫广泛应用于数据挖掘、搜索引擎优化、信息采集等领
神经网络之CNN文本识别邪恶的贝利亚神经网络 cnn 人工智能
1.参考我的第一篇文章了解CNN概念神经网络之CNN图像识别(torchapi调用)-CSDN博客2.框架目前对NLP的研究分析应用最多的就是RNN系列的框架，比如RNN,GRU,LSTM等等，再加上Attention，基本可以认为是NLP的标配套餐了。但是在文本分类问题上，相比于RNN，CNN的构建和训练更为简单和快速，并且效果也不差，所以仍然会有一些研究。那么，CNN到底是怎么应用到NLP上的
python数据分析入门与实战王静_Keras快速上手：基于Python的深度学习实战 weixin_39724362
1准备深度学习的环境11.1硬件环境的搭建和配置选择.........................11.1.1通用图形处理单元..........................31.1.2你需要什么样的GPU加速卡....................61.1.3你的GPU需要多少内存.......................61.1.4是否应该用多个GPU..............
PyTorch RuntimeError: 张量 a 的大小必须与张量 b 的大小在非单例维度上匹配 PzBlockchain pytorch 人工智能 python 机器学习-深度学习
在使用PyTorch进行深度学习模型开发时，经常会遇到各种错误和异常。其中一个常见的错误是RuntimeError。这篇文章将详细介绍其中一个特定的RuntimeError，即“Thesizeoftensoramustmatchthesizeoftensorbatnon-singletondimension”错误。我们将讨论这个错误的原因，并提供一些解决方案。错误信息解读：错误信息“Thesize
数据挖掘与数据分析两者的区别中琛源科技
随着大数据爆发式增长，市场上对大数据相关人才的需求与日俱增，导致大数据行业人才需求紧缺，引发了关于大数据的学习浪潮，在这个过程中，人们也会不时将数据分析与数据挖掘的关系混淆，什么是数据挖掘?与数据分析有什么联系吗?又或者说数据挖掘与数据分析有什么区别呢?让我们带着这些问题，一起往下解惑吧。数据分析简单的说，就是对数据进行分析，比较专业的说法是，数据分析是指用适当的统计分析方法对收集来的大量数据进行
Word2Vec的使用，一些思考，含示例——包括使用预训练Word2Vec模型和自训练Word2Vec模型热爱生活的猴子 NLP_自然语言处理 word2vec 人工智能自然语言处理
词嵌入模型（WordEmbeddings）——Word2Vec简介：Word2Vec是由Google团队提出的一种词嵌入方法，通过神经网络模型将词语映射到一个低维的连续向量空间中。你可以直接通过它训练生成词向量，也就是一个新的Word2Vec，也可以使用预训练好的词向量，也就是那里直接用。它有两种模型结构：CBOW（ContinuousBagofWords）和Skip-Gram。CBOW（连续词袋
数据挖掘与数据分析 dundunmm 数据挖掘数据挖掘数据分析人工智能
数据挖掘和数据分析是两个密切相关但有所区别的领域，它们都涉及从数据中提取有价值的信息，但在目标、方法和技术上有所不同。数据挖掘vs.数据分析特征数据挖掘数据分析目标从大数据中自动发现知识和模式通过系统分析数据，得出有意义的结论重点数据模式的自动发现、预测模型的构建数据理解、数据清洗、数据总结、假设验证方法机器学习、聚类、回归、关联规则、深度学习等统计学方法、数据可视化、数据清理、假设检验等应用实时
An Introduction to Statistical Learning with Applicatio AI天才研究院 Python实战 DeepSeek R1 &大数据AI人工智能大模型大数据人工智能语言模型 Java Python 架构设计
作者：禅与计算机程序设计艺术1.简介1.1定义统计学习（statisticallearning）是一门研究如何从数据中提取知识并应用于预测、决策或其他目的的一门学科。它是机器学习、数据挖掘、计算机视觉等领域的一个分支，是当前热门的AI方向。1.2特点数据驱动：统计学习倾向于采用结构化的数据——如表格或矩阵形式——作为输入；假设空间少：统计学习通常只考虑一种假设空间，即概率模型或概率分布；模型复杂性
AI大模型教程入门到精通，非常详细收藏我这一篇就够了！AI大模型零基础入门教程（适合小白） AGI大模型学习人工智能大模型应用大模型 AI产品经理学习 AI大模型大模型教程
什么是AI大模型？AI大模型是指使用大规模数据和强大的计算能力训练出来的人工智能模型。这些模型通常具有高度的准确性和泛化能力，可以应用于各种领域，如自然语言处理、图像识别、语音识别等。为什么要学AI大模型？2024人工智能大模型的技术岗位与能力培养随着人工智能技术的迅速发展和应用，大模型作为其中的重要组成部分，正逐渐成为推动人工智能发展的重要引擎。大模型以其强大的数据处理和模式识别能力，广泛应用于
Python网络爬虫：从原理到实践的全面解析九月 linux python 网络爬虫爬虫
1.什么是Python爬虫？Python网络爬虫是一种通过Python编写的自动化程序，用于从互联网上的网页、数据库或服务器中提取结构化数据。其核心功能是模拟人类浏览网页的行为，按照预设规则遍历目标网站，抓取文本、图片、链接等信息。爬虫也被称为“网页蜘蛛”或“网络机器人”，广泛应用于搜索引擎索引构建、数据挖掘等领域。2.Python爬虫的运作机制爬虫的工作流程可分为以下步骤：发送请求：通过HTTP
Python网络爬虫：从原理到实践的全面解析九月 python 爬虫开发语言网络爬虫
在信息爆炸的时代，Python网络爬虫已成为获取和分析网络数据的核心技术。它不仅能够自动化采集海量网页信息，还能通过数据挖掘创造商业价值。本文将系统讲解Python爬虫的核心概念、技术实现、应用场景及法律边界，为读者构建完整的知识框架。一、什么是Python网络爬虫？Python网络爬虫是一种自动化脚本程序，通过模拟浏览器行为访问网页，按照预设规则抓取目标数据。其核心原理是：发送HTTP请求：向目
DiNN学习笔记1-理论部分瓜皮37 同态加密密码学信息安全神经网络
DiNN学习笔记1-理论部分背景知识机器学习即服务MLaaS中的全同态加密神经网络Fhe-DiNN中的默认设定Fhe-DiNN方案神经元中的计算离散神经网络DiNN评估步骤自举的引入激活函数的同态评估对TFHE的改进明文的打包密钥转换的前置动态变化的消息空间优化盲旋步骤DiNN方案的整体流程参考资料背景知识机器学习即服务机器学习即服务(MachineLearningasaService,MLaaS
大模型算法工程师的技术图谱和学习路径执于代码开发者职业加速服务算法学习
介绍：大模型算法工程师是指在开发和部署复杂的机器学习模型、深度学习模型或其他大规模模型的专业人员。他们的主要职责和技能要求包括：职责：设计、开发和优化大规模机器学习或深度学习模型，解决复杂的业务问题。负责整个模型开发生命周期，包括数据清洗、特征工程、模型选择、训练和部署。与数据科学家、工程团队和产品团队合作，理解业务需求并将算法转化为实际产品。对模型性能进行评估和优化，确保模型的准确性、效率和可扩
图像算法工程师的技术图谱和学习路径执于代码开发者职业加速服务算法学习
01.图像算法图像算法工程师的技术图谱和学习路径涵盖了多个技术领域，从基础知识到高级算法，涉及计算机视觉、深度学习、图像处理、数学和编程等多个方面。以下是图像算法工程师的技术图谱和学习路径的详细总结。1.基础数学与编程数学基础：线性代数：矩阵运算、特征值、特征向量、奇异值分解（SVD）等概率论与统计：概率分布、贝叶斯定理、最大似然估计（MLE）、假设检验等微积分：导数、梯度、最优化方法（梯度下降、
数据挖掘的建模流程慢跑的Liam 算法数据挖掘算法流程模型构建
1、定义数据挖掘目标任务理解指标确定2、数据取样建模抽样(大数据是用过滤后的全量数据)抽样之前需要衡量数据质量衡量的标准主要有以下几点：资料完整无缺，各类指标齐全数据准确无误，反映的都是正常状态下的数据数据抽样的方式：随机抽样等距抽样分层抽样从起始位置开始抽样分类抽样实时采集3、数据探索数据质量分析1.数据质量分析是数据挖掘分析结论有效性的基础2.缺失值分析3.异常值分析是用来检测数据是否有录入错
AI大模型知识图谱和学习路线！ hhaiming_ 人工智能知识图谱学习
23年AI大模型技术狂飙一年后，24年AI大模型的应用已经在爆发，因此掌握好AI大模型的应用开发技术就变成如此重要，那么如何才能更好地掌握呢？一份AI大模型详细的知识图谱和学习路线就变得非常重要！一、大模型全套的学习路线学习大型人工智能模型，如GPT-3、BERT或任何其他先进的神经网络模型，需要系统的方法和持续的努力。既然要系统的学习大模型，那么学习路线是必不可少的，下面的这份路线能帮助你快速梳
【深度学习】Hopfield网络：模拟联想记忆 T-I-M 深度学习人工智能
Transformer优化，什么是稀疏注意力？Transformer模型自2017年被提出以来，已经成为自然语言处理（NLP）领域的核心架构，并在计算机视觉、语音处理等其他领域也取得了显著的成功。然而，随着模型规模的不断增大和任务复杂性的提升，Transformer的计算成本和内存需求也随之激增。为了解决这一问题，研究者们提出了多种优化方法，其中稀疏注意力（SparseAttention）是一种备
深度学习pytorch之4种归一化方法（Normalization）原理公式解析和参数使用 @Mr_LiuYang 计算机视觉基础归一化正则化 Normlization BatchNorm LayerNorm InstanceNrom GroupNorm
深度学习pytorch之22种损失函数数学公式和代码定义深度学习pytorch之19种优化算法（optimizer）解析深度学习pytorch之4种归一化方法（Normalization）原理公式解析和参数使用摘要归一化（Normalization）是提升模型性能、加速训练的重要技巧。归一化方法可以帮助减少梯度消失或爆炸的问题，提升模型的收敛速度，且对最终模型的性能有显著影响。本文将以PyTorc
【2025年超全汇总】大模型常见面试题及详细答案解析（非常详细）收藏这一篇就够了！ Cc不爱吃洋葱人工智能大语言模型语言模型 LLM 大模型大模型面试大模型算法
大模型相关的面试问题通常涉及模型的原理、应用、优化以及面试者对于该领域的理解和经验。以下是一些常见的大模型面试问题以及建议的回答方式：请简述什么是大模型，以及它与传统模型的主要区别是什么？回答：大模型通常指的是参数数量巨大的深度学习模型，如GPT系列。它们与传统模型的主要区别在于规模：大模型拥有更多的参数和更复杂的结构，从而能够处理更复杂、更广泛的任务。此外，大模型通常需要更多的数据和计算资源进行
数据挖掘与数据分析的区别是什么中琛源科技
数据挖掘与数据分析两者紧密相连，具有循环递归的关系，数据分析结果需要进一步进行数据挖掘才能指导决策，而数据挖掘进行价值评估的过程也需要调整先验约束而再次进行数据分析。从分析的目的来看，数据分析一般是对历史数据进行统计学上的一些分析，数据挖掘更侧重于机器对未来的预测，一般应用于分类、聚类、推荐、关联规则等。从分析的过程来看，数据分析更侧重于统计学上面的一些方法，经过人的推理演译得到结论；数据挖掘更侧
【深度学习·命运-27】NAS四部曲end-NASNet 华东算法王深度学习·命运深度学习人工智能
NASNet（NeuralArchitectureSearchNetwork）是由GoogleBrain团队提出的另一种神经架构搜索（NAS）方法，它通过自动化搜索神经网络的结构，找到了具有竞争力的神经网络架构，尤其在计算机视觉任务（如图像分类）中表现非常优秀。NASNet是基于进化算法的架构搜索方法，与其他NAS方法相比，它具有更高的效率，并且能够生成更加优化的网络架构。1.NASNet的背景与
DeepSeek 1.5B 蒸馏模型的征程 6 部署（Llama 方式）自动驾驶算法
前言DeepSeek是一款基于人工智能的搜索引擎，旨在提升用户的搜索体验。它利用先进的自然语言处理技术，通过理解查询的上下文和意图，为用户提供更精确、相关的搜索结果。与传统的搜索引擎不同，DeepSeek不仅仅依赖于关键词匹配，还能通过深度学习分析用户的需求，呈现更加智能化的搜索结果。此外，DeepSeek还具备语义理解能力，能够处理复杂的查询，并在短时间内给出最符合用户需求的答案。DeepSee
探索vLLM Chat：作为OpenAI API替代方案的强大工具 qq_37836323 python
#探索vLLMChat：作为OpenAIAPI替代方案的强大工具##引言随着生成式AI技术的发展，许多应用都依赖于强大的语言模型API来提供自然语言处理任务的支持。vLLM是一款可以作为OpenAIAPI协议替代品的聊天模型服务器。它的设计允许您在应用中无缝替换OpenAIAPI，实现相似的功能和性能。本文将介绍如何使用vLLM，结合langchain-openai包，来快速部署和集成聊天模型。#
【数学建模】基于matlab模拟无人车泊车问题仿真 matlab科研助手数学建模 matlab 开发语言
✅作者简介：热爱科研的Matlab仿真开发者，修心和技术同步精进，代码获取、论文复现及科研仿真合作可私信。个人主页：Matlab科研工作室个人信条：格物致知。更多Matlab完整代码及仿真定制内容点击智能优化算法神经网络预测雷达通信无线传感器电力系统信号处理图像处理路径规划元胞自动机无人机物理应用机器学习内容介绍无人驾驶汽车技术近年来取得了飞速发展，其中自动泊车功能是关键技术之一。本文将重点讨论无
AI大语言模型概述：从GPT到BERT的技术演进 AI智能涌现深度研究 AI大模型应用入门实战与进阶 DeepSeek R1 &大数据AI人工智能计算大数据人工智能语言模型 AI 大模型 LLM Java Python 架构设计 Agent RPA
1.背景介绍1.1什么是大语言模型大语言模型是一种基于深度学习的自然语言处理技术，它可以理解和生成人类语言。这些模型通过学习大量的文本数据，捕捉到语言的语法、语义和情感等信息，从而实现对自然语言的理解和生成。1.2为什么大语言模型如此重要大语言模型在近年来取得了显著的进展，它们在各种自然语言处理任务中都取得了最先进的性能。这些任务包括机器翻译、情感分析、文本摘要、问答系统等。大语言模型的成功在很大
解读Servlet原理篇二---GenericServlet与HttpServlet 周凡杨 java HttpServlet 源理 GenericService 源码
在上一篇《解读Servlet原理篇一》中提到，要实现javax.servlet.Servlet接口（即写自己的Servlet应用），你可以写一个继承自javax.servlet.GenericServletr的generic Servlet ，也可以写一个继承自java.servlet.http.HttpServlet的HTTP Servlet（这就是为什么我们自定义的Servlet通常是exte
MySQL性能优化 bijian1013 数据库 mysql
性能优化是通过某些有效的方法来提高MySQL的运行速度，减少占用的磁盘空间。性能优化包含很多方面，例如优化查询速度，优化更新速度和优化MySQL服务器等。本文介绍方法的主要有： a.优化查询 b.优化数据库结构
ThreadPool定时重试 dai_lm java ThreadPool thread timer timertask
项目需要当某事件触发时，执行http请求任务，失败时需要有重试机制，并根据失败次数的增加，重试间隔也相应增加，任务可能并发。由于是耗时任务，首先考虑的就是用线程来实现，并且为了节约资源，因而选择线程池。为了解决不定间隔的重试，选择Timer和TimerTask来完成 package threadpool; public class ThreadPoolTest {
Oracle 查看数据库的连接情况周凡杨 sql oracle 连接
首先要说的是，不同版本数据库提供的系统表会有不同，你可以根据数据字典查看该版本数据库所提供的表。 select * from dict where table_name like '%SESSION%'; 就可以查出一些表，然后根据这些表就可以获得会话信息 select sid,serial#,status,username,schemaname,osuser,terminal,ma
类的继承朱辉辉33 java
类的继承可以提高代码的重用行，减少冗余代码；还能提高代码的扩展性。Java继承的关键字是extends 格式:public class 类名（子类）extends 类名（父类）{ } 子类可以继承到父类所有的属性和普通方法，但不能继承构造方法。且子类可以直接使用父类的public和 protected属性，但要使用private属性仍需通过调用。子类的方法可以重写，但必须和父类的返回值类
android 悬浮窗特效肆无忌惮_ android
最近在开发项目的时候需要做一个悬浮层的动画，类似于支付宝掉钱动画。但是区别在于，需求是浮出一个窗口，之后边缩放边位移至屏幕右下角标签处。效果图如下：一开始考虑用自定义View来做。后来发现开线程让其移动很卡，ListView+动画也没法精确定位到目标点。后来想利用Dialog的dismiss动画来完成。自定义一个Dialog后，在styl
hadoop伪分布式搭建林鹤霄 hadoop
要修改4个文件 1: vim hadoop-env.sh 第九行 2: vim core-site.xml <configuration> &n
gdb调试命令 aigo gdb
原文：http://blog.csdn.net/hanchaoman/article/details/5517362 一、GDB常用命令简介 r run 运行.程序还没有运行前使用 c cuntinue
Socket编程的HelloWorld实例 alleni123 socket
public class Client { public static void main(String[] args) { Client c=new Client(); c.receiveMessage(); } public void receiveMessage(){ Socket s=null; BufferedRea
线程同步和异步百合不是茶线程同步异步
多线程和同步 : 如进程、线程同步，可理解为进程或线程A和B一块配合，A执行到一定程度时要依靠B的某个结果，于是停下来，示意B运行；B依言执行，再将结果给A；A再继续操作。所谓同步，就是在发出一个功能调用时，在没有得到结果之前，该调用就不返回，同时其它线程也不能调用这个方法多线程和异步:多线程可以做不同的事情,涉及到线程通知 &
JSP中文乱码分析 bijian1013 java jsp 中文乱码
在JSP的开发过程中，经常出现中文乱码的问题。首先了解一下Java中文问题的由来： Java的内核和class文件是基于unicode的，这使Java程序具有良好的跨平台性，但也带来了一些中文乱码问题的麻烦。原因主要有两方面，
js实现页面跳转重定向的几种方式 bijian1013 JavaScript 重定向
js实现页面跳转重定向有如下几种方式：一.window.location.href <script language="javascript"type="text/javascript"> window.location.href="http://www.baidu.c
【Struts2三】Struts2 Action转发类型 bit1129 struts2
在【Struts2一】 Struts Hello World http://bit1129.iteye.com/blog/2109365中配置了一个简单的Action，配置如下 <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configurat
【HBase十一】Java API操作HBase bit1129 hbase
Admin类的主要方法注释： 1. 创建表 /** * Creates a new table. Synchronous operation. * * @param desc table descriptor for table * @throws IllegalArgumentException if the table name is res
nginx gzip ronin47 nginx gzip
Nginx GZip 压缩 Nginx GZip 模块文档详见：http://wiki.nginx.org/HttpGzipModule 常用配置片段如下： gzip on; gzip_comp_level 2; # 压缩比例，比例越大，压缩时间越长。默认是1 gzip_types text/css text/javascript; # 哪些文件可以被压缩 gzip_disable &q
java-7.微软亚院之编程判断俩个链表是否相交给出俩个单向链表的头指针，比如 h1 ， h2 ，判断这俩个链表是否相交 bylijinnan java
public class LinkListTest { /** * we deal with two main missions: * * A. * 1.we create two joined-List(both have no loop) * 2.whether list1 and list2 join * 3.print the join
Spring源码学习-JdbcTemplate batchUpdate批量操作 bylijinnan java spring
Spring JdbcTemplate的batch操作最后还是利用了JDBC提供的方法，Spring只是做了一下改造和封装 JDBC的batch操作： String sql = "INSERT INTO CUSTOMER " + "(CUST_ID, NAME, AGE) VALUES (?, ?, ?)";
[JWFD开源工作流]大规模拓扑矩阵存储结构最新进展 comsci 工作流
生成和创建类已经完成,构造一个100万个元素的矩阵模型,存储空间只有11M大,请大家参考我在博客园上面的文档"构造下一代工作流存储结构的尝试",更加相信的设计和代码将陆续推出......... 竞争对手的能力也很强.......,我相信..你们一定能够先于我们推出大规模拓扑扫描和分析系统的....
base64编码和url编码 cuityang base64 url
import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintWriter; import java.io.StringWriter; import java.io.UnsupportedEncodingException;
web应用集群Session保持 dalan_123 session
关于使用 memcached 或redis 存储 session ，以及使用 terracotta 服务器共享。建议使用 redis，不仅仅因为它可以将缓存的内容持久化，还因为它支持的单个对象比较大，而且数据类型丰富，不只是缓存 session，还可以做其他用途，一举几得啊。1、使用 filter 方法存储这种方法比较推荐，因为它的服务器使用范围比较多，不仅限于tomcat ，而且实现的原理比较简
Yii 框架里数据库操作详解-[增加、查询、更新、删除的方法 'AR模式'] dcj3sjt126com 数据库
public function getMinLimit () { $sql = "..."; $result = yii::app()->db->createCo
solr StatsComponent（聚合统计） eksliang solr聚合查询 solr stats
StatsComponent 转载请出自出处：http://eksliang.iteye.com/blog/2169134 http://eksliang.iteye.com/ 一、概述 Solr可以利用StatsComponent 实现数据库的聚合统计查询，也就是min、max、avg、count、sum的功能二、参数
百度一道面试题 greemranqq 位运算百度面试寻找奇数算法 bitmap 算法
那天看朋友提了一个百度面试的题目：怎么找出{1,1,2,3,3,4,4,4,5,5,5,5} 找出出现次数为奇数的数字. 我这里复制的是原话，当然顺序是不一定的，很多拿到题目第一反应就是用map,当然可以解决，但是效率不高。还有人觉得应该用算法xxx,我是没想到用啥算法好...！还有觉得应该先排序... 还有觉
Spring之在开发中使用SpringJDBC ihuning spring
在实际开发中使用SpringJDBC有两种方式： 1. 在Dao中添加属性JdbcTemplate并用Spring注入； JdbcTemplate类被设计成为线程安全的，所以可以在IOC 容器中声明它的单个实例，并将这个实例注入到所有的 DAO 实例中。JdbcTemplate也利用了Java 1.5 的特定(自动装箱，泛型，可变长度
JSON API 1.0 核心开发者自述 | 你所不知道的那些技术细节 justjavac json
2013年5月，Yehuda Katz 完成了JSON API(英文，中文) 技术规范的初稿。事情就发生在 RailsConf 之后，在那次会议上他和 Steve Klabnik 就 JSON 雏形的技术细节相聊甚欢。在沟通单一 Rails 服务器库—— ActiveModel::Serializers 和单一 JavaScript 客户端库——&
网站项目建设流程概述 macroli 工作
一.概念网站项目管理就是根据特定的规范、在预算范围内、按时完成的网站开发任务。二.需求分析项目立项　　我们接到客户的业务咨询，经过双方不断的接洽和了解，并通过基本的可行性讨论够，初步达成制作协议，这时就需要将项目立项。较好的做法是成立一个专门的项目小组，小组成员包括：项目经理，网页设计，程序员，测试员，编辑/文档等必须人员。项目实行项目经理制。客户的需求说明书　　第一步是需
AngularJs 三目运算表达式判断 qiaolevip 每天进步一点点学习永无止境众观千象 AngularJS
事件回顾：由于需要修改同一个模板，里面包含2个不同的内容，第一个里面使用的时间差和第二个里面名称不一样，其他过滤器，内容都大同小异。希望杜绝If这样比较傻的来判断if-show or not，继续追究其源码。 var b = "{{", a = "}}"; this.startSymbol = function(a) {
Spark算子：统计RDD分区中的元素及数量 superlxw1234 spark spark算子 Spark RDD分区元素
关键字：Spark算子、Spark RDD分区、Spark RDD分区元素数量 Spark RDD是被分区的，在生成RDD时候，一般可以指定分区的数量，如果不指定分区数量，当RDD从集合创建时候，则默认为该程序所分配到的资源的CPU核数，如果是从HDFS文件创建，默认为文件的Block数。可以利用RDD的mapPartitionsWithInd
Spring 3.2.x将于2016年12月31日停止支持 wiselyman Spring 3
Spring 团队公布在2016年12月31日停止对Spring Framework 3.2.x（包含tomcat 6.x）的支持。在此之前spring团队将持续发布3.2.x的维护版本。请大家及时准备及时升级到Spring
fis纯前端解决方案fis-pure zccst JavaScript
作者：zccst FIS通过插件扩展可以完美的支持模块化的前端开发方案，我们通过FIS的二次封装能力，封装了一个功能完备的纯前端模块化方案pure。 1，fis-pure的安装 $ fis install -g fis-pure $ pure -v 0.1.4 2，下载demo到本地 git clone https://github.com/hefangshi/f