氵文大师

[转载] Transformer debug 专用代码

全文转载地址：
https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/nn/layer/transformer.py

看完 Transformer 的博客或者论文，有老铁肯定想看看代码是怎么写的，本文是从飞桨Paddle 主框架扒的代码，每一个"__name__ == __main__"下，都可以需要debug来查看

一共有6个主要类，
MultiHeadAttention, TransformerEncoderLayer, TransformerEncoder
TransformerDecoderLayer, TransformerDecoder, Transformer

TransformerEncoderLayer 类和 TransformerDecoderLayer 类都会调用 MultiHeadAttention 去组网
TransformerEncoder 和 TransformerDecoder 会调用TransformerEncoderLayer 类和 TransformerDecoderLayer 类去重复多次进行组网
Transformer 会调用TransformerEncoder 和 TransformerDecoder 去组网
以上是组网架构

还需要注意的一个细节是：

if attn_mask is not None and attn_mask.dtype != dtype:
    attn_mask_dtype = convert_dtype(attn_mask.dtype) # paddle.dtype => string
    if attn_mask_dtype == 'bool' or 'int' in attn_mask_dtype:
        attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e9  # 因为后面有 softmax 
    else:
        attn_mask = paddle.cast(attn_mask, dtype)
return attn_mask

if attn_mask is not None:
    # Support bool or int mask
    attn_mask = _convert_attention_mask(attn_mask, product.dtype) # attn_mask in [-1e9, 0]
    product = product + attn_mask
weights = F.softmax(product)

由于后面有 softmax，所以我的值越小，越可以使其为0，所以这里有这步操作：

attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e9

#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# TODO: define the classes of Transformer neural network

import copy
import collections
import numpy as np__

import paddle
from paddle.nn import LayerNorm, Linear, Dropout, Layer, LayerList
import paddle.nn.functional as F
import paddle.tensor as tensor
import paddle.fluid.layers as layers
from paddle.fluid.data_feeder import convert_dtype


def _convert_param_attr_to_list(param_attr, n):
    """
    If `param_attr` is a list or tuple, convert every element in it to a
    ParamAttr instance. Otherwise, repeat `param_attr` `n` times to
    construct a list, and rename every one by appending a increasing index
    suffix to avoid having same names when `param_attr` contains a name.

    Parameters:
        param_attr (list|tuple|ParamAttr): A list, tuple or something can be
            converted to a ParamAttr instance by `ParamAttr._to_attr`.
        n (int): The times to repeat to construct a list when `param_attr`
            is not a list or tuple.

    Returns:
        list: A list composed of each including cell's `param_attr`.
    """
    if isinstance(param_attr, (list, tuple)):
        assert len(param_attr) == n, (
            "length of param_attr should be %d when it is a list/tuple" % n)
        param_attrs = []
        for attr in param_attr:
            if isinstance(attr, bool):
                if attr:
                    param_attrs.append(paddle.ParamAttr._to_attr(None))
                else:
                    param_attrs.append(False)
            else:
                param_attrs.append(paddle.ParamAttr._to_attr(attr))
        # param_attrs = [paddle.ParamAttr._to_attr(attr) for attr in param_attr]
    elif isinstance(param_attr, bool):
        param_attrs = []
        if param_attr:
            param_attrs = [paddle.ParamAttr._to_attr(None) for i in range(n)]
        else:
            param_attrs = [False] * n
    else:
        param_attrs = []
        attr = paddle.ParamAttr._to_attr(param_attr)
        for i in range(n):
            attr_i = copy.deepcopy(attr)
            if attr.name:
                attr_i.name = attr_i.name + "_" + str(i)
            param_attrs.append(attr_i)
    return param_attrs


def _convert_attention_mask(attn_mask, dtype):
    """
    Convert the attention mask to the target dtype we expect.

    Parameters:
        attn_mask (Tensor, optional): A tensor used in multi-head attention
                to prevents attention to some unwanted positions, usually the
                paddings or the subsequent positions. It is a tensor with shape
                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
                When the data type is bool, the unwanted positions have `False` 
                values and the others have `True` values. When the data type is 
                int, the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
        dtype (VarType): The target type of `attn_mask` we expect.

    Returns:
        Tensor: A Tensor with shape same as input `attn_mask`, with data type `dtype`.
    """
    if attn_mask is not None and attn_mask.dtype != dtype:
        attn_mask_dtype = convert_dtype(attn_mask.dtype) # paddle.dtype => string
        if attn_mask_dtype == 'bool' or 'int' in attn_mask_dtype:
            attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e9  # 因为后面有 softmax 
        else:
            attn_mask = paddle.cast(attn_mask, dtype)
    return attn_mask


class MultiHeadAttention(Layer):
    """
    Attention mapps queries and a set of key-value pairs to outputs, and
    Multi-Head Attention performs multiple parallel attention to jointly attending
    to information from different representation subspaces.

    Please refer to `Attention Is All You Need `_
    for more details.

    Parameters:
        embed_dim (int): The expected feature size in the input and output.
        num_heads (int): The number of heads in multi-head attention.
        dropout (float, optional): The dropout probability used on attention
            weights to drop some attention targets. 0 for no dropout. Default 0
        kdim (int, optional): The feature size in key. If None, assumed equal to
            `embed_dim`. Default None.
        vdim (int, optional): The feature size in value. If None, assumed equal to
            `embed_dim`. Default None.
        need_weights (bool, optional): Indicate whether to return the attention
            weights. Default False.
        weight_attr(ParamAttr, optional):  To specify the weight parameter property.
            Default: None, which means the default weight parameter property is used.
            See usage for details in :code:`ParamAttr` .
        bias_attr (ParamAttr|bool, optional): To specify the bias parameter property.
            Default: None, which means the default bias parameter property is used.
            If it is set to False, this layer will not have trainable bias parameter.
            See usage for details in :code:`ParamAttr` .
         
    Examples:

        .. code-block:: python

            import paddle

            # encoder input: [batch_size, sequence_length, d_model]
            query = paddle.rand((2, 4, 128))
            # self attention mask: [batch_size, num_heads, query_len, query_len]
            attn_mask = paddle.rand((2, 2, 4, 4))
            multi_head_attn = paddle.nn.MultiHeadAttention(128, 2)
            output = multi_head_attn(query, None, None, attn_mask=attn_mask)  # [2, 4, 128]
    """

    Cache = collections.namedtuple("Cache", ["k", "v"])
    StaticCache = collections.namedtuple("StaticCache", ["k", "v"])

    def __init__(self,
                 embed_dim,
                 num_heads,
                 dropout=0.,
                 kdim=None,
                 vdim=None,
                 need_weights=False,
                 weight_attr=None,
                 bias_attr=None):
        super(MultiHeadAttention, self).__init__()

        assert embed_dim > 0, ("Expected embed_dim to be greater than 0, "
                               "but received {}".format(embed_dim))
        assert num_heads > 0, ("Expected num_heads to be greater than 0, "
                               "but received {}".format(num_heads))

        self.embed_dim = embed_dim
        self.kdim = kdim if kdim is not None else embed_dim
        self.vdim = vdim if vdim is not None else embed_dim
        self.num_heads = num_heads
        self.dropout = dropout
        self.need_weights = need_weights

        self.head_dim = embed_dim // num_heads # 做这一步意思是每个 head 做操作最后堆叠是吗?
        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"

        self.q_proj = Linear(embed_dim,
                             embed_dim,
                             weight_attr,
                             bias_attr=bias_attr)
        
        # 因为先做了线性投射层, 所以 kdim 和 vdim 参数可以不同 
        self.k_proj = Linear(self.kdim,
                             embed_dim,
                             weight_attr,
                             bias_attr=bias_attr)
        self.v_proj = Linear(self.vdim,
                             embed_dim,
                             weight_attr,
                             bias_attr=bias_attr)
        self.out_proj = Linear(embed_dim,
                               embed_dim,
                               weight_attr,
                               bias_attr=bias_attr)

    def _prepare_qkv(self, query, key, value, cache=None):
        r"""
        Prapares linear projected queries, keys and values for usage of subsequnt
        multiple parallel attention. If `cache` is not None, using cached results
        to reduce redundant calculations.

        Parameters:
            query (Tensor): The queries for multi-head attention. It is a
                tensor with shape `[batch_size, query_length, embed_dim]`. The
                data type should be float32 or float64.
            key (Tensor): The keys for multi-head attention. It is
                a tensor with shape `[batch_size, key_length, kdim]`. The
                data type should be float32 or float64. If None, use `query` as
                `key`.
            value (Tensor): The values for multi-head attention. It
                is a tensor with shape `[batch_size, value_length, vdim]`.
                The data type should be float32 or float64. If None, use `query` as
                `value`.
            cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional):
                It is a namedtuple with `k` and `v` as fields, and stores tensors
                shaped `[batch_size, num_heads, length, embed_dim]` which are results
                of linear projection, reshape and transpose calculations in
                MultiHeadAttention. If is an instance of `Cache`, `k` and `v`
                fields reserve intermediate results of previous positions, which
                mostly used for decoder self attention. If it is an instance of
                `StaticCache`, `key` and `value` args would be ignored, `k` and
                `v` fields would be used as calculated results on `key` and
                `value`, which mostly used for decoder-encoder cross attention.
                It is only used for inference and should be None for training.
                Default None.

        Returns:
            tuple: A tuple including linear projected keys and values. These two \
                tensors have shapes `[batch_size, n_head, sequence_length, d_key]` \
                and `[batch_size, n_head, sequence_length, d_value]` separately, \
                and their data types are same as inputs.
        """
        q = self.q_proj(query)
        q = tensor.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim]) # [0, 0]  => [bs, seq_len]
        q = tensor.transpose(x=q, perm=[0, 2, 1, 3]) # [bs, num_heads, seq_len, head_dim]  其中 head_dim * num_heads == embed_dim

        if isinstance(cache, self.StaticCache): # StaticCache 是 namedtuple 创建的类
            # for encoder-decoder attention in inference and has cached
            k, v = cache.k, cache.v
        else:
            k, v = self.compute_kv(key, value) # [bs, num_heads, seq_len, head_dim]

        if isinstance(cache, self.Cache):
            # for decoder self-attention in inference
            k = tensor.concat([cache.k, k], axis=2)
            v = tensor.concat([cache.v, v], axis=2)
            cache = self.Cache(k, v)

        return (q, k, v) if cache is None else (q, k, v, cache)

    def compute_kv(self, key, value):
        r"""
        Applies linear projection on input keys and values, then splits heads
        (reshape and transpose) to get keys and values from different representation
        subspaces. The results are used as key-values pairs for subsequent multiple
        parallel attention.
        
        It is part of calculations in multi-head attention, and is provided as
        a method to pre-compute and prefetch these results, thus we can use them
        to construct cache for inference.

        Parameters:
            key (Tensor): The keys for multi-head attention. It is a tensor
                with shape `[batch_size, sequence_length, kdim]`. The data type
                should be float32 or float64.
            value (Tensor): The values for multi-head attention. It is a tensor
                with shape `[batch_size, sequence_length, vdim]`. The data type
                should be float32 or float64.

        Returns:
            tuple: A tuple including transformed keys and values. Their shapes \
                both are `[batch_size, num_heads, sequence_length, embed_dim // num_heads]`, \
                and their data types are same as inputs.
        """
        k = self.k_proj(key)
        v = self.v_proj(value)
        k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim]) # [bs, seq_len, num_heads, head_dim]
        k = tensor.transpose(x=k, perm=[0, 2, 1, 3])
        v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim])
        v = tensor.transpose(x=v, perm=[0, 2, 1, 3])
        return k, v

    def gen_cache(self, key, value=None, type=Cache):
        """
        Generates cache for `forward` usage in inference accroding to arguments.
        The generated cache is an instance of `MultiHeadAttention.Cache` or an
        instance of `MultiHeadAttention.StaticCache`.

        `Cache` or `StaticCache` is namedtuple with `k` and `v` as fields,
        and it stores tensors shaped `[batch_size, num_heads, length, embed_dim]`
        which are results of linear projection, reshape and transpose calculations
        in MultiHeadAttention.
        
        If the generated cache is an instance of `Cache`, `k` and `v` fields
        reserve intermediate result tensors of previous positions, and the tensors
        are incremental among decoding steps, which mostly are used for decoder
        decoder self attention.
        
        If the generated cache is an instance of `StaticCache`, `k` and `v` fields
        would be used as calculated result tensors on keys an values in `forward`,
        and the tensors keep unchanged among decoding steps, which are mostly used
        for decoder-encoder cross attention.

        The cache is generated as follows:

        1. If `type` is `StaticCache`, apply `compute_kv(key, value)` and use the
        results to create an instance of `StaticCache`.
        
        2. If `type` is `Cache` and `value` is None, generate empty tensors shaped
        `[batch_size, num_heads, 0, embed_dim // num_heads]` and use the results
        to create an instance of `Cache`, where `batch_size` is from the first
        dimension of `key`.

        3. If `type` is `Cache` and `value` is not None, use `key`, `value` to create
        an instance of `Cache`.

        Parameters:
            key (Tensor): The keys for multi-head attention. It is
                a tensor with shape `[batch_size, key_length, kdim]`. The
                data type should be float32 or float64. If `value` is None,
                it is only for batch size and data type reference.
            value (Tensor, optional): The values for multi-head attention. It
                is a tensor with shape `[batch_size, value_length, vdim]`.
                The data type should be float32 or float64. If None, `key` is only
                for batch size reference. Default None.
            type (type): It should be `MultiHeadAttention.StaticCache` or
                `MultiHeadAttention.Cache` to indicate the cache type to generate.
        
        Returns:
            namedtuple: an instance of `Cache` or `StaticCache` accordingly.
        """
        if type == MultiHeadAttention.StaticCache:  # static_kv
            k, v = self.compute_kv(key, value)
            return self.StaticCache(k, v)
        elif value is None:  # incremental_state
            k = layers.fill_constant_batch_size_like(
                input=key,
                shape=[-1, self.num_heads, 0, self.head_dim],
                dtype=key.dtype,
                value=0)
            v = layers.fill_constant_batch_size_like(
                input=key,
                shape=[-1, self.num_heads, 0, self.head_dim],
                dtype=key.dtype,
                value=0)
            return self.Cache(k, v)
        else:
            # incremental_state with initial value, mainly for usage like UniLM
            return self.Cache(key, value)

    def forward(self, query, key=None, value=None, attn_mask=None, cache=None):
        r"""
        Applies multi-head attention to map queries and a set of key-value pairs
        to outputs.

        Parameters:
            query (Tensor): The queries for multi-head attention. It is a
                tensor with shape `[batch_size, query_length, embed_dim]`. The
                data type should be float32 or float64.
            key (Tensor, optional): The keys for multi-head attention. It is
                a tensor with shape `[batch_size, key_length, kdim]`. The
                data type should be float32 or float64. If None, use `query` as
                `key`. Default None.
            value (Tensor, optional): The values for multi-head attention. It
                is a tensor with shape `[batch_size, value_length, vdim]`.
                The data type should be float32 or float64. If None, use `query` as
                `value`. Default None.
            attn_mask (Tensor, optional): A tensor used in multi-head attention
                to prevents attention to some unwanted positions, usually the
                paddings or the subsequent positions. It is a tensor with shape
                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
                When the data type is bool, the unwanted positions have `False` 
                values and the others have `True` values. When the data type is 
                int, the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional):
                It is a namedtuple with `k` and `v` as fields, and stores tensors
                shaped `[batch_size, num_heads, length, embed_dim]` which are results
                of linear projection, reshape and transpose calculations in
                MultiHeadAttention. If it is an instance of `Cache`, `k` and `v`
                fields reserve intermediate results of previous positions, which
                mostly used for decoder self attention. If it is an instance of
                `StaticCache`, `key` and `value` args would be ignored, `k` and
                `v` fields would be used as calculated results on `key` and
                `value`, which mostly used for decoder-encoder cross attention.
                It is only used for inference and should be None for training.
                Default None.

        Returns:
            Tensor|tuple: It is a tensor that has the same shape and data type \
                as `query`, representing attention output. Or a tuple if \
                `need_weights` is True or `cache` is not None. If `need_weights` \
                is True, except for attention output, the tuple also includes \
                the attention weights tensor shaped `[batch_size, num_heads, query_length, key_length]`. \
                If `cache` is not None, the tuple then includes the new cache \
                having the same type as `cache`, and if it is `StaticCache`, it \
                is same as the input `cache`, if it is `Cache`, the new cache \
                reserves tensors concatanating raw tensors with intermediate \
                results of current query.
        """
        # key 和 value 为 None 时, 直接全部赋为 query
        key = query if key is None else key
        value = query if value is None else value
        # compute q ,k ,v
        if cache is None:
            q, k, v = self._prepare_qkv(query, key, value, cache)
        else:
            q, k, v, cache = self._prepare_qkv(query, key, value, cache)

        # --------------- 以上的 qkv 都是 [bs, num_heads, seq_len, head_dim] ---------------
        # scale dot product attention
        product = paddle.matmul(x=q * (self.head_dim**-0.5), # [a,b,c,d] @ [a,b,d,c] => [a,b,c,c] (y已转置)
                                y=k,
                                transpose_y=True)
        if attn_mask is not None:
            # Support bool or int mask
            attn_mask = _convert_attention_mask(attn_mask, product.dtype) # attn_mask in [-1e9, 0]
            product = product + attn_mask
        weights = F.softmax(product)
        if self.dropout:
            weights = F.dropout(weights,
                                self.dropout,
                                training=self.training,
                                mode="upscale_in_train")

        out = tensor.matmul(weights, v)

        # combine heads
        out = tensor.transpose(out, perm=[0, 2, 1, 3]) # [bs, num_head, seq_len, head_dim] => [bs, seq_len, num_head, head_dim]
        out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]]) # [bs, seq_len, embed_dim] embed_dim = num_head * head_dim

        # project to output
        out = self.out_proj(out) # 最后的线性投射层

        outs = [out]
        if self.need_weights:
            outs.append(weights)
        if cache is not None:
            outs.append(cache)
        return out if len(outs) == 1 else tuple(outs)


if __name__ == "__main__":
    
    # encoder input: [batch_size, sequence_length, d_model] 
    query = paddle.rand((2, 4, 128)) # d_model 是 Embd 长度
    
    # self attention mask: [batch_size, num_heads, query_len, query_len]
    attn_mask = paddle.rand((2, 2, 4, 4))
    multi_head_attn = MultiHeadAttention(128, 2, dropout=0.1107)
    output = multi_head_attn(query, None, None, attn_mask=attn_mask)  # [2, 4, 128]
    
    print("MultiHeadAttention", output.shape)




class TransformerEncoderLayer(Layer): # 与 Attention 原文的图一致
    """
    TransformerEncoderLayer is composed of two sub-layers which are self (multi-head)
    attention and feedforward network. Before and after each sub-layer, pre-process
    and post-process would be applied on the input and output accordingly. If
    `normalize_before` is True, pre-process is layer normalization and post-precess
    includes dropout, residual connection. Otherwise, no pre-process and post-precess
    includes dropout, residual connection, layer normalization.

    Parameters:
        d_model (int): The expected feature size in the input and output.
        nhead (int): The number of heads in multi-head attention(MHA).
        dim_feedforward (int): The hidden layer size in the feedforward network(FFN).
        dropout (float, optional): The dropout probability used in pre-process
            and post-precess of MHA and FFN sub-layer. Default 0.1
        activation (str, optional): The activation function in the feedforward
            network. Default relu.
        attn_dropout (float, optional): The dropout probability used
            in MHA to drop some attention target. If None, use the value of
            `dropout`. Default None
        act_dropout (float, optional): The dropout probability used after FFN
            activition.  If None, use the value of `dropout`. Default None
        normalize_before (bool, optional): Indicate whether to put layer normalization
            into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer
            normalization and post-precess includes dropout, residual connection.
            Otherwise, no pre-process and post-precess includes dropout, residual
            connection, layer normalization. Default False
        weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property.
            If it is a list/tuple, `weight_attr[0]` would be used as `weight_attr` for
            MHA, and `weight_attr[1]` would be used as `weight_attr` for linear in FFN.
            Otherwise, MHA and FFN both use it as `weight_attr` to create parameters.
            Default: None, which means the default weight parameter property is used.
            See usage for details in :code:`ParamAttr` . 
        bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property.
            If it is a list/tuple, `bias_attr[0]` would be used as `bias_attr` for
            MHA, and `bias_attr[1]` would be used as `bias_attr` for linear in FFN.
            Otherwise, MHA and FFN both use it as `bias_attr` to create parameters.
            The `False` value means the corresponding layer would not have trainable
            bias parameter. See usage for details in :code:`ParamAttr` . Default: None,
            which means the default bias parameter property is used.
            

    Examples:

        .. code-block:: python

            import paddle
            from paddle.nn import TransformerEncoderLayer

            # encoder input: [batch_size, src_len, d_model]
            enc_input = paddle.rand((2, 4, 128))
            # self attention mask: [batch_size, n_head, src_len, src_len]
            attn_mask = paddle.rand((2, 2, 4, 4))
            encoder_layer = TransformerEncoderLayer(128, 2, 512)
            enc_output = encoder_layer(enc_input, attn_mask)  # [2, 4, 128]
    """

    def __init__(self,
                 d_model,
                 nhead,
                 dim_feedforward,
                 dropout=0.1,
                 activation="relu",
                 attn_dropout=None,
                 act_dropout=None,
                 normalize_before=False,
                 weight_attr=None,
                 bias_attr=None):
        self._config = locals()
        self._config.pop("self")
        self._config.pop("__class__", None)  # py3  # why?

        super(TransformerEncoderLayer, self).__init__()

        assert d_model > 0, ("Expected d_model to be greater than 0, "
                             "but received {}".format(d_model))
        assert nhead > 0, ("Expected nhead to be greater than 0, "
                           "but received {}".format(nhead))
        assert dim_feedforward > 0, (
            "Expected dim_feedforward to be greater than 0, "
            "but received {}".format(dim_feedforward))

        attn_dropout = dropout if attn_dropout is None else attn_dropout
        act_dropout = dropout if act_dropout is None else act_dropout
        self.normalize_before = normalize_before

        weight_attrs = _convert_param_attr_to_list(weight_attr, 2)
        bias_attrs = _convert_param_attr_to_list(bias_attr, 2)

        self.self_attn = MultiHeadAttention(d_model,
                                            nhead,
                                            dropout=attn_dropout,
                                            weight_attr=weight_attrs[0],
                                            bias_attr=bias_attrs[0])
        self.linear1 = Linear(d_model,
                              dim_feedforward,
                              weight_attrs[1],
                              bias_attr=bias_attrs[1])
        self.dropout = Dropout(act_dropout, mode="upscale_in_train")
        self.linear2 = Linear(dim_feedforward,
                              d_model,
                              weight_attrs[1],
                              bias_attr=bias_attrs[1])
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
        self.dropout1 = Dropout(dropout, mode="upscale_in_train")
        self.dropout2 = Dropout(dropout, mode="upscale_in_train")
        self.activation = getattr(F, activation)

    def forward(self, src, src_mask=None, cache=None):
        r"""
        Applies a Transformer encoder layer on the input.

        Parameters:
            src (Tensor): The input of Transformer encoder layer. It is
                a tensor with shape `[batch_size, sequence_length, d_model]`.
                The data type should be float32 or float64.
            src_mask (Tensor, optional): A tensor used in multi-head attention
                to prevents attention to some unwanted positions, usually the
                paddings or the subsequent positions. It is a tensor with shape
                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
                When the data type is bool, the unwanted positions have `False` 
                values and the others have `True` values. When the data type is 
                int, the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            cache (Tensor, optional): It is an instance of `MultiHeadAttention.Cache`.
                See `TransformerEncoderLayer.gen_cache` for more details. It is
                only used for inference and should be None for training. Default
                None.

        Returns:
            Tensor|tuple: It is a tensor that has the same shape and data type \
                as `enc_input`, representing the output of Transformer encoder \
                layer. Or a tuple if `cache` is not None, except for encoder \
                layer output, the tuple includes the new cache which is same \
                as input `cache` argument but `incremental_cache` has an \
                incremental length. See `MultiHeadAttention.gen_cache` and \
                `MultiHeadAttention.forward` for more details.
        """
        src_mask = _convert_attention_mask(src_mask, src.dtype) # src_mask 是不愿注意到的地方

        residual = src # [bs, seq_len, d_model(emdb_size)]
        if self.normalize_before:
            src = self.norm1(src)
        # Add cache for encoder for the usage like UniLM
        if cache is None:
            src = self.self_attn(src, src, src, src_mask) # 先来一个自注意力 # 看 Attention 原文 qkv 都是一个值
        else:
            src, incremental_cache = self.self_attn(src, src, src, src_mask,
                                                    cache)

        src = residual + self.dropout1(src)
        if not self.normalize_before:
            src = self.norm1(src)

        residual = src
        if self.normalize_before:
            src = self.norm2(src)
        src = self.linear2(self.dropout(self.activation(self.linear1(src))))
        src = residual + self.dropout2(src)
        if not self.normalize_before:
            src = self.norm2(src)
        return src if cache is None else (src, incremental_cache)

    def gen_cache(self, src):
        r"""
        Generates cache for `forward` usage. The generated cache is an 
        instance of `MultiHeadAttention.Cache`.

        Parameters:
            src (Tensor): The input of Transformer encoder. It is a tensor
                with shape `[batch_size, source_length, d_model]`. The data 
                type should be float32 or float64.

        Returns:
            incremental_cache: It is an instance of `MultiHeadAttention.Cache` \
                produced by `self_attn.gen_cache`, it reserves two tensors 
                shaped `[batch_size, nhead, 0, d_model // nhead]`. See \
                `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
                for more details.
        """
        incremental_cache = self.self_attn.gen_cache(src,
                                                     type=self.self_attn.Cache)
        return incremental_cache


if __name__ == "__main__":
    
    # encoder input: [batch_size, seq_len, d_model]
    enc_input = paddle.rand((16, 4, 128))
    # self attention mask: [batch_size, n_head, seq_len, seq_len]
    attn_mask = paddle.rand((16, 2, 4, 4))
    encoder_layer = TransformerEncoderLayer(128, 2, 512) # [d_model, nhead, dim_feedforward]  d_model => d_model 是 Embd 长度
    enc_output = encoder_layer(enc_input, attn_mask)  # [16, 4, 128] # [bs, seq_len, d_model]
    
    print("TransformerEncoderLayer", enc_output.shape)



class TransformerEncoder(Layer): # encoder layers 的堆叠
    """
    TransformerEncoder is a stack of N encoder layers. 

    Parameters:
        encoder_layer (Layer): an instance of the `TransformerEncoderLayer`. It
            would be used as the first layer, and the other layers would be created
            according to the configurations of it.
        num_layers (int): The number of encoder layers to be stacked.
        norm (LayerNorm, optional): the layer normalization component. If provided,
            apply layer normalization on the output of last encoder layer.

    Examples:

        .. code-block:: python

            import paddle
            from paddle.nn import TransformerEncoderLayer, TransformerEncoder

            # encoder input: [batch_size, src_len, d_model]
            enc_input = paddle.rand((2, 4, 128))
            # self attention mask: [batch_size, n_head, src_len, src_len]
            attn_mask = paddle.rand((2, 2, 4, 4))
            encoder_layer = TransformerEncoderLayer(128, 2, 512)
            encoder = TransformerEncoder(encoder_layer, 2)
            enc_output = encoder(enc_input, attn_mask)  # [2, 4, 128]
    """

    def __init__(self, encoder_layer, num_layers, norm=None):
        super(TransformerEncoder, self).__init__()
        self.layers = LayerList([
            (encoder_layer if i == 0 else type(encoder_layer)(   # 这里要重新创建所以 `._config` 其他有用 deepcopy 实现的
                **encoder_layer._config)) for i in range(num_layers)
        ])
        self.num_layers = num_layers
        self.norm = norm

    def forward(self, src, src_mask=None, cache=None):
        r"""
        Applies a stack of N Transformer encoder layers on inputs. If `norm` is
        provided, also applies layer normalization on the output of last encoder
        layer.

        Parameters:
            src (Tensor): The input of Transformer encoder. It is a tensor
                with shape `[batch_size, sequence_length, d_model]`. The data
                type should be float32 or float64.
            src_mask (Tensor, optional): A tensor used in multi-head attention
                to prevents attention to some unwanted positions, usually the
                paddings or the subsequent positions. It is a tensor with shape
                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
                When the data type is bool, the unwanted positions have `False` 
                values and the others have `True` values. When the data type is 
                int, the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            cache (list, optional): It is a list, and each element in the list
                is `incremental_cache` produced by `TransformerEncoderLayer.gen_cache`. 
                See `TransformerEncoder.gen_cache` for more details. It is only
                used for inference and should be None for training. Default None.

        Returns:
            Tensor|tuple: It is a tensor that has the same shape and data type \
                as `src`, representing the output of Transformer encoder. \
                Or a tuple if `cache` is not None, except for encoder output, \
                the tuple includes the new cache which is same as input `cache` \
                argument but `incremental_cache` in it has an incremental length. \
                See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
                for more details.
        """
        src_mask = _convert_attention_mask(src_mask, src.dtype)

        output = src
        new_caches = []
        for i, mod in enumerate(self.layers):
            if cache is None:
                output = mod(output, src_mask=src_mask)
            else:
                output, new_cache = mod(output,
                                        src_mask=src_mask,
                                        cache=cache[i])
                new_caches.append(new_cache)

        if self.norm is not None:
            output = self.norm(output)

        return output if cache is None else (output, new_caches)

    def gen_cache(self, src):
        r"""
        Generates cache for `forward` usage. The generated cache is a list, and
        each element in it is `incremental_cache` produced by 
        `TransformerEncoderLayer.gen_cache`. See `TransformerEncoderLayer.gen_cache`
        for more details.

        Parameters:
            src (Tensor): The input of Transformer encoder. It is a tensor
                with shape `[batch_size, source_length, d_model]`. The data type
                should be float32 or float64.

        Returns:
            list: It is a list, and each element in the list is `incremental_cache` 
            produced by `TransformerEncoderLayer.gen_cache`. See 
            `TransformerEncoderLayer.gen_cache` for more details.
        """
        cache = [layer.gen_cache(src) for layer in self.layers]
        return cache


if __name__ == "__main__":
    # encoder input: [batch_size, src_len, d_model]
    enc_input = paddle.rand((32, 4, 128))
    # self attention mask: [batch_size, n_head, src_len, src_len]
    attn_mask = paddle.rand((32, 2, 4, 4))
    encoder_layer = TransformerEncoderLayer(128, 2, 512) # [d_model, nheads(num_heads), dim_feedforward]
    encoder = TransformerEncoder(encoder_layer, 8)    # encoder_layer, num_layers(重复的次数)
    enc_output = encoder(enc_input, attn_mask)        # [32, 4, 128] # [bs, seq_len, d_model(embd_size)]
    
    print("TransformerEncoder", enc_output.shape)


class TransformerDecoderLayer(Layer):
    """
    TransformerDecoderLayer is composed of three sub-layers which are decoder
    self (multi-head) attention, decoder-encoder cross attention and feedforward
    network. Before and after each sub-layer, pre-process and post-precess would
    be applied on the input and output accordingly. If `normalize_before` is True,
    pre-process is layer normalization and post-precess includes dropout, residual
    connection. Otherwise, no pre-process and post-precess includes dropout, residual
    connection, layer normalization.

    Parameters:
        d_model (int): The expected feature size in the input and output.
        nhead (int): The number of heads in multi-head attention(MHA).
        dim_feedforward (int): The hidden layer size in the feedforward network(FFN).
        dropout (float, optional): The dropout probability used in pre-process
            and post-precess of MHA and FFN sub-layer. Default 0.1
        activation (str, optional): The activation function in the feedforward
            network. Default relu.
        attn_dropout (float, optional): The dropout probability used
            in MHA to drop some attention target. If None, use the value of
            `dropout`. Default None
        act_dropout (float, optional): The dropout probability used after FFN
            activition.  If None, use the value of `dropout`. Default None
        normalize_before (bool, optional): Indicate whether to put layer normalization
            into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer
            normalization and post-precess includes dropout, residual connection.
            Otherwise, no pre-process and post-precess includes dropout, residual
            connection, layer normalization. Default False
        weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property.
            If it is a list/tuple, `weight_attr[0]` would be used as `weight_attr` for
            self attention, `weight_attr[1]` would be used as `weight_attr` for
            cross attention, and `weight_attr[2]` would be used as `weight_attr`
            for linear in FFN. Otherwise, the three sub-layers all uses it as
            `weight_attr` to create parameters. Default: None, which means the
            default weight parameter property is used. See usage for details
            in :ref:`api_paddle_fluid_param_attr_ParamAttr` . 
        bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property.
            If it is a list/tuple, `bias_attr[0]` would be used as `bias_attr` for
            self attention, `bias_attr[1]` would be used as `bias_attr` for
            cross attention, and `bias_attr[2]` would be used as `bias_attr`
            for linear in FFN. Otherwise, the three sub-layers all uses it as
            `bias_attr` to create parameters. The `False` value means the
            corresponding layer would not have trainable bias parameter. See
            usage for details in :code:`ParamAttr` . Default: None,which means
            the default bias parameter property is used.

    Examples:

        .. code-block:: python

            import paddle
            from paddle.nn import TransformerDecoderLayer

            # decoder input: [batch_size, tgt_len, d_model]
            dec_input = paddle.rand((2, 4, 128))
            # encoder output: [batch_size, src_len, d_model]
            enc_output = paddle.rand((2, 6, 128))
            # self attention mask: [batch_size, n_head, tgt_len, tgt_len]
            self_attn_mask = paddle.rand((2, 2, 4, 4))
            # cross attention mask: [batch_size, n_head, tgt_len, src_len]
            cross_attn_mask = paddle.rand((2, 2, 4, 6))
            decoder_layer = TransformerDecoderLayer(128, 2, 512)
            output = decoder_layer(dec_input,
                                   enc_output,
                                   self_attn_mask,
                                   cross_attn_mask)  # [2, 4, 128]
    """

    def __init__(self,
                 d_model,
                 nhead,
                 dim_feedforward,
                 dropout=0.1,
                 activation="relu",
                 attn_dropout=None,
                 act_dropout=None,
                 normalize_before=False,
                 weight_attr=None,
                 bias_attr=None):
        self._config = locals()
        self._config.pop("self")
        self._config.pop("__class__", None)  # py3

        super(TransformerDecoderLayer, self).__init__()

        assert d_model > 0, ("Expected d_model to be greater than 0, "
                             "but received {}".format(d_model))
        assert nhead > 0, ("Expected nhead to be greater than 0, "
                           "but received {}".format(nhead))
        assert dim_feedforward > 0, (
            "Expected dim_feedforward to be greater than 0, "
            "but received {}".format(dim_feedforward))

        attn_dropout = dropout if attn_dropout is None else attn_dropout
        act_dropout = dropout if act_dropout is None else act_dropout
        self.normalize_before = normalize_before

        weight_attrs = _convert_param_attr_to_list(weight_attr, 3)
        bias_attrs = _convert_param_attr_to_list(bias_attr, 3)

        self.self_attn = MultiHeadAttention(d_model,
                                            nhead,
                                            dropout=attn_dropout,
                                            weight_attr=weight_attrs[0],
                                            bias_attr=bias_attrs[0])
        self.cross_attn = MultiHeadAttention(d_model,
                                             nhead,
                                             dropout=attn_dropout,
                                             weight_attr=weight_attrs[1],
                                             bias_attr=bias_attrs[1])
        self.linear1 = Linear(d_model,
                              dim_feedforward,
                              weight_attrs[2],
                              bias_attr=bias_attrs[2])
        self.dropout = Dropout(act_dropout, mode="upscale_in_train")
        self.linear2 = Linear(dim_feedforward,
                              d_model,
                              weight_attrs[2],
                              bias_attr=bias_attrs[2])
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
        self.norm3 = LayerNorm(d_model)
        self.dropout1 = Dropout(dropout, mode="upscale_in_train")
        self.dropout2 = Dropout(dropout, mode="upscale_in_train")
        self.dropout3 = Dropout(dropout, mode="upscale_in_train")
        self.activation = getattr(F, activation)

    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
        r"""
        Applies a Transformer decoder layer on the input.

        Parameters:
            tgt (Tensor): The input of Transformer decoder layer. It is a tensor
                with shape `[batch_size, target_length, d_model]`. The data type
                should be float32 or float64.
            memory (Tensor): The output of Transformer encoder. It is a tensor
                with shape `[batch_size, source_length, d_model]`. The data type
                should be float32 or float64.
            tgt_mask (Tensor, optional): A tensor used in self attention
                to prevents attention to some unwanted positions, usually the
                the subsequent positions. It is a tensor with shape broadcasted
                to `[batch_size, n_head, target_length, target_length]`.
                When the data type is bool, the unwanted positions have `False` 
                values and the others have `True` values. When the data type is 
                int, the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            memory_mask (Tensor, optional): A tensor used in decoder-encoder
                cross attention to prevents attention to some unwanted positions,
                usually the paddings. It is a tensor with shape broadcasted to 
                `[batch_size, n_head, target_length, source_length]`. When the 
                data type is bool, the unwanted positions have `False` values 
                and the others have `True` values. When the data type is int, 
                the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            cache (tuple, optional): It is a tuple( :code:`(incremental_cache, static_cache)` ),
                `incremental_cache` is an instance of `MultiHeadAttention.Cache`,
                `static_cache` is an instance of `MultiHeadAttention.StaticCache.
                See `TransformerDecoderLayer.gen_cache` for more details. It is
                only used for inference and should be None for training. Default
                None.

        Returns:
            Tensor|tuple: It is a tensor that has the same shape and data type \
                as `tgt`, representing the output of Transformer decoder layer. \
                Or a tuple if `cache` is not None, except for decoder layer output, \
                the tuple includes the new cache which is same as input `cache` \
                argument but `incremental_cache` in it has an incremental length. \
                See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
                for more details.
        """
        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)

        # ----------------- 以下是 Masked Multi-Head Attention -----------------
        residual = tgt
        if self.normalize_before:
            tgt = self.norm1(tgt)
        if cache is None:
            tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None)
        else:
            tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask,
                                                    cache[0])
        tgt = residual + self.dropout1(tgt)
        if not self.normalize_before:
            tgt = self.norm1(tgt)

        
        # ----------------- 以下是 Multi-Head Attention -----------------
        residual = tgt
        if self.normalize_before:
            tgt = self.norm2(tgt)
        if cache is None:
            tgt = self.cross_attn(tgt, memory, memory, memory_mask, None)
        else:
            tgt, static_cache = self.cross_attn(tgt, memory, memory,
                                                memory_mask, cache[1])
        tgt = residual + self.dropout2(tgt)
        if not self.normalize_before:
            tgt = self.norm2(tgt)

        residual = tgt
        if self.normalize_before:
            tgt = self.norm3(tgt)
        tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
        tgt = residual + self.dropout3(tgt)
        if not self.normalize_before:
            tgt = self.norm3(tgt)
        return tgt if cache is None else (tgt, (incremental_cache,
                                                static_cache))

    def gen_cache(self, memory):
        r"""
        Generates cache for `forward` usage. The generated cache is a tuple
        composed of an instance of `MultiHeadAttention.Cache` and an instance
        of `MultiHeadAttention.StaticCache`.

        Parameters:
            memory (Tensor): The output of Transformer encoder. It is a tensor
                with shape `[batch_size, source_length, d_model]`. The data type
                should be float32 or float64.

        Returns:
            tuple: It is a tuple( :code:`(incremental_cache, static_cache)` ). \
                `incremental_cache` is an instance of `MultiHeadAttention.Cache` \
                produced by `self_attn.gen_cache(memory, MultiHeadAttention.Cache)`, \
                it reserves two tensors shaped `[batch_size, nhead, 0, d_model // nhead]`. \
                `static_cache` is an instance of `MultiHeadAttention.StaticCache` \
                produced by `cross_attn.gen_cache(memory, MultiHeadAttention.StaticCache)`, \
                it reserves two tensors shaped `[batch_size, nhead, source_length, d_model // nhead]`.
                See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
                for more details.
        """
        incremental_cache = self.self_attn.gen_cache(memory,
                                                     type=self.self_attn.Cache)
        static_cache = self.cross_attn.gen_cache(
            memory, memory, type=self.cross_attn.StaticCache)
        return incremental_cache, static_cache



if __name__ == "__main__":
    # decoder input:  [batch_size, tgt_len, d_model]
    dec_input  = paddle.rand((2, 4, 128))
    # encoder output: [batch_size, src_len, d_model]
    enc_output = paddle.rand((2, 6, 128))
    # self attention mask: [batch_size, n_head, tgt_len, tgt_len]
    self_attn_mask = paddle.rand((2, 2, 4, 4))
    # cross attention mask: [batch_size, n_head, tgt_len, src_len]
    cross_attn_mask = paddle.rand((2, 2, 4, 6))
    decoder_layer = TransformerDecoderLayer(128, 2, 512)  # d_model, nhead, dim_feedforward
    output = decoder_layer(dec_input,
                           enc_output,  # memory
                           self_attn_mask,
                           cross_attn_mask)  # [2, 4, 128] # [bs, seq_len, d_model]
    
    print("TransformerDecoderLayer", output.shape) # output.shape == dec_input.shape


class TransformerDecoder(Layer):
    """
    TransformerDecoder is a stack of N decoder layers. 

    Parameters:
        decoder_layer (Layer): an instance of the `TransformerDecoderLayer`. It
            would be used as the first layer, and the other layers would be created
            according to the configurations of it.
        num_layers (int): The number of decoder layers to be stacked.
        norm (LayerNorm, optional): the layer normalization component. If provided,
            apply layer normalization on the output of last encoder layer.

    Examples:

        .. code-block:: python

            import paddle
            from paddle.nn import TransformerDecoderLayer, TransformerDecoder

            # decoder input: [batch_size, tgt_len, d_model]
            dec_input = paddle.rand((2, 4, 128))
            # encoder output: [batch_size, src_len, d_model]
            enc_output = paddle.rand((2, 6, 128))
            # self attention mask: [batch_size, n_head, tgt_len, tgt_len]
            self_attn_mask = paddle.rand((2, 2, 4, 4))
            # cross attention mask: [batch_size, n_head, tgt_len, src_len]
            cross_attn_mask = paddle.rand((2, 2, 4, 6))
            decoder_layer = TransformerDecoderLayer(128, 2, 512)
            decoder = TransformerDecoder(decoder_layer, 2)
            output = decoder(dec_input,
                             enc_output,
                             self_attn_mask,
                             cross_attn_mask)  # [2, 4, 128]
    """

    def __init__(self, decoder_layer, num_layers, norm=None):
        super(TransformerDecoder, self).__init__()
        self.layers = LayerList([
            (decoder_layer if i == 0 else type(decoder_layer)(
                **decoder_layer._config)) for i in range(num_layers)
        ])
        self.num_layers = num_layers
        self.norm = norm

    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None):
        r"""
        Applies a stack of N Transformer decoder layers on inputs. If `norm` is
        provided, also applies layer normalization on the output of last decoder
        layer.

        Parameters:
            tgt (Tensor): The input of Transformer decoder. It is a tensor
                with shape `[batch_size, target_length, d_model]`. The data type
                should be float32 or float64.
            memory (Tensor): The output of Transformer encoder. It is a tensor
                with shape `[batch_size, source_length, d_model]`. The data type
                should be float32 or float64.
            tgt_mask (Tensor, optional): A tensor used in self attention
                to prevents attention to some unwanted positions, usually the
                the subsequent positions. It is a tensor with shape broadcasted
                to `[batch_size, n_head, target_length, target_length]`. When 
                the data type is bool, the unwanted positions have `False` 
                values and the others have `True` values. When the data type is 
                int, the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            memory_mask (Tensor, optional): A tensor used in decoder-encoder
                cross attention to prevents attention to some unwanted positions,
                usually the paddings. It is a tensor with shape broadcasted to
                `[batch_size, n_head, target_length, source_length]`. When the 
                data type is bool, the unwanted positions have `False` values 
                and the others have `True` values. When the data type is int, 
                the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            cache (list, optional): It is a list, and each element in the list
                is a tuple( :code:`(incremental_cache, static_cache)` ). See
                `TransformerDecoder.gen_cache` for more details. It is only
                used for inference and should be None for training. Default None.

        Returns:
            Tensor|tuple: It is a tensor that has the same shape and data type \
                as `tgt`, representing the output of Transformer decoder. \
                Or a tuple if `cache` is not None, except for decoder output, \
                the tuple includes the new cache which is same as input `cache` \
                argument but `incremental_cache` in it has an incremental length. \
                See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \
                for more details.
        """
        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)

        output = tgt
        new_caches = []
        for i, mod in enumerate(self.layers):
            if cache is None:
                output = mod(output,
                             memory,
                             tgt_mask=tgt_mask,
                             memory_mask=memory_mask,
                             cache=None)
            else:
                output, new_cache = mod(output,
                                        memory,
                                        tgt_mask=tgt_mask,
                                        memory_mask=memory_mask,
                                        cache=cache[i])
                new_caches.append(new_cache)

        if self.norm is not None:
            output = self.norm(output)

        return output if cache is None else (output, new_caches)

    def gen_cache(self, memory, do_zip=False):
        r"""
        Generates cache for `forward` usage. The generated cache is a list, and
        each element in it is a tuple( :code:`(incremental_cache, static_cache)` )
        produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache`
        for more details. If `do_zip` is True, apply `zip` on these tuples to get
        a list with two elements.


        Parameters:
            memory (Tensor): The output of Transformer encoder. It is a tensor
                with shape `[batch_size, source_length, d_model]`. The data type
                should be float32 or float64.
            do_zip (bool, optional): Indicate whether to apply `zip` on the tuples.
                If True, return a list with two elements. Default False

        Returns:
            list: It is a list, and each element in the list is a tuple produced \
                by `TransformerDecoderLayer.gen_cache(memory)`. See `TransformerDecoderLayer.gen_cache` \
                for more details. If `do_zip` is True, apply `zip` on these tuples \
                and return a list with two elements.
        """
        cache = [layer.gen_cache(memory) for layer in self.layers]
        if do_zip:
            cache = list(zip(*cache))
        return cache


if __name__ == "__main__":
    # decoder input: [batch_size, tgt_len, d_model]
    dec_input = paddle.rand((2, 4, 128))
    # encoder output: [batch_size, src_len, d_model]
    enc_output = paddle.rand((2, 6, 128))
    # self attention mask: [batch_size, n_head, tgt_len, tgt_len]
    self_attn_mask = paddle.rand((2, 2, 4, 4))
    # cross attention mask: [batch_size, n_head, tgt_len, src_len]
    cross_attn_mask = paddle.rand((2, 2, 4, 6))
    decoder_layer = TransformerDecoderLayer(128, 2, 512) # d_model, nhead, dim_feedforward
    decoder = TransformerDecoder(decoder_layer, 2)
    output = decoder(dec_input,
                     enc_output,
                     self_attn_mask,
                     cross_attn_mask)  # [2, 4, 128]
    
    print("TransformerDecoder", output.shape)


class Transformer(Layer):
    """
    A Transformer model composed of an instance of `TransformerEncoder` and an
    instance of `TransformerDecoder`. While the embedding layer and output layer
    are not included.

    Please refer to `Attention is all you need `_ ,
    and see `TransformerEncoder` and `TransformerDecoder` for more details.
    
    Users can configurate the model architecture with corresponding parameters.
    Note the usage of `normalize_before` representing where to apply layer
    normalization (in pre-process or post-precess of multi-head attention or FFN),
    and some transformer like models are different on this, such as
    `BERT `_ and `GPT2 `_ . 
    The default architecture here places layer normalization in post-process and
    applies another layer normalization on the output of last encoder/decoder layer.

    Parameters:
        d_model (int, optional): The expected feature size in the encoder/decoder input
            and output. Default 512
        nhead (int, optional): The number of heads in multi-head attention(MHA). Default 8
        num_encoder_layers (int, optional): The number of layers in encoder. Default 6
        num_decoder_layers (int, optional): The number of layers in decoder. Default 6
        dim_feedforward (int, optional): The hidden layer size in the feedforward network(FFN). Default 2048
        dropout (float, optional): The dropout probability used in pre-process
            and post-precess of MHA and FFN sub-layer. Default 0.1
        activation (str, optional): The activation function in the feedforward
            network. Default relu.
        attn_dropout (float, optional): The dropout probability used
            in MHA to drop some attention target. If None, use the value of
            `dropout`. Default None
        act_dropout (float, optional): The dropout probability used after FFN
            activition.  If None, use the value of `dropout`. Default None
        normalize_before (bool, optional): Indicate whether to put layer normalization
            into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer
            normalization and post-precess includes dropout, residual connection.
            Otherwise, no pre-process and post-precess includes dropout, residual
            connection, layer normalization. Default False
        weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property.
            If it is a list/tuple, the length of `weight_attr` could be 1, 2 or 3. If it is 3, 
            `weight_attr[0]` would be used as `weight_attr` for self attention, `weight_attr[1]` 
            would be used as `weight_attr` for cross attention of `TransformerDecoder`, 
            and `weight_attr[2]` would be used as `weight_attr` for linear in FFN. 
            If it is 2, `weight_attr[0]` would be used as `weight_attr` both for self attention 
            and cross attntion and `weight_attr[1]` would be used as `weight_attr` for 
            linear in FFN. If it is 1, `weight_attr[0]` would be used as `weight_attr` 
            for self attention, cross attention and linear in FFN. Otherwise, 
            the three sub-layers all uses it as `weight_attr` to create parameters. 
            Default: None, which means the default weight parameter property is used. 
            See usage for details
            in :code:`ParamAttr` . 
        bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property.
            If it is a list/tuple, the length of `bias_attr` could be 1, 2 or 3. If it is 3, 
            `bias_attr[0]` would be used as `bias_attr` for self attention, `bias_attr[1]` 
            would be used as `bias_attr` for cross attention of `TransformerDecoder`, 
            and `bias_attr[2]` would be used as `bias_attr` for linear in FFN. 
            If it is 2, `bias_attr[0]` would be used as `bias_attr` both for self attention 
            and cross attntion and `bias_attr[1]` would be used as `bias_attr` for 
            linear in FFN. If it is 1, `bias_attr[0]` would be used as `bias_attr` 
            for self attention, cross attention and linear in FFN. Otherwise, 
            the three sub-layers all uses it as `bias_attr` to create parameters. 
            The `False` value means the corresponding layer would not have trainable 
            bias parameter. See usage for details in :code:`ParamAttr` . 
            Default: None,which means the default bias parameter property is used.
        custom_encoder (Layer, optional): If custom encoder is provided, use it as the encoder.
            Default None
        custom_decoder (Layer, optional): If custom decoder is provided, use it as the decoder.
            Default None

    Examples:

        .. code-block:: python

            import paddle
            from paddle.nn import Transformer

            # src: [batch_size, tgt_len, d_model]
            enc_input = paddle.rand((2, 4, 128))
            # tgt: [batch_size, src_len, d_model]
            dec_input = paddle.rand((2, 6, 128))
            # src_mask: [batch_size, n_head, src_len, src_len]
            enc_self_attn_mask = paddle.rand((2, 2, 4, 4))
            # tgt_mask: [batch_size, n_head, tgt_len, tgt_len]
            dec_self_attn_mask = paddle.rand((2, 2, 6, 6))
            # memory_mask: [batch_size, n_head, tgt_len, src_len]
            cross_attn_mask = paddle.rand((2, 2, 6, 4))
            transformer = Transformer(128, 2, 4, 4, 512)
            output = transformer(enc_input,
                                 dec_input,
                                 enc_self_attn_mask,
                                 dec_self_attn_mask,
                                 cross_attn_mask)  # [2, 6, 128]
    """

    def __init__(self,
                 d_model=512,
                 nhead=8,
                 num_encoder_layers=6,
                 num_decoder_layers=6,
                 dim_feedforward=2048,
                 dropout=0.1,
                 activation="relu",
                 attn_dropout=None,
                 act_dropout=None,
                 normalize_before=False,
                 weight_attr=None,
                 bias_attr=None,
                 custom_encoder=None,
                 custom_decoder=None):
        super(Transformer, self).__init__()

        assert d_model > 0, ("Expected d_model to be greater than 0, "
                             "but received {}".format(d_model))
        assert nhead > 0, ("Expected nhead to be greater than 0, "
                           "but received {}".format(nhead))
        assert dim_feedforward > 0, (
            "Expected dim_feedforward to be greater than 0, "
            "but received {}".format(dim_feedforward))

        if isinstance(bias_attr, (list, tuple)):
            if len(bias_attr) == 1:
                encoder_bias_attr = [bias_attr[0]] * 2
                decoder_bias_attr = [bias_attr[0]] * 3
            elif len(bias_attr) == 2:
                encoder_bias_attr = bias_attr
                decoder_bias_attr = [bias_attr[0], bias_attr[0], bias_attr[-1]]
            elif len(bias_attr) == 3:
                encoder_bias_attr = [bias_attr[0], bias_attr[-1]]
                decoder_bias_attr = bias_attr
            else:
                assert False, (
                    "length of bias_attr should be 1 or 2 or 3 when it is a list/tuple"
                )
        else:
            encoder_bias_attr = bias_attr
            decoder_bias_attr = bias_attr

        if isinstance(weight_attr, (list, tuple)):
            if len(weight_attr) == 1:
                encoder_weight_attr = [weight_attr[0]] * 2
                decoder_weight_attr = [weight_attr[0]] * 3
            elif len(weight_attr) == 2:
                encoder_weight_attr = weight_attr
                decoder_weight_attr = [
                    weight_attr[0], weight_attr[0], weight_attr[-1]
                ]
            elif len(weight_attr) == 3:
                encoder_weight_attr = [weight_attr[0], weight_attr[-1]]
                decoder_weight_attr = weight_attr
            else:
                assert False, (
                    "length of weight_attr should be 1 or 2 or 3 when it is a list/tuple"
                )
        else:
            encoder_weight_attr = weight_attr
            decoder_weight_attr = weight_attr

        if custom_encoder is not None:
            self.encoder = custom_encoder
        else:
            encoder_layer = TransformerEncoderLayer(
                d_model, nhead, dim_feedforward, dropout, activation,
                attn_dropout, act_dropout, normalize_before,
                encoder_weight_attr, encoder_bias_attr)
            encoder_norm = LayerNorm(d_model)
            self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers,
                                              encoder_norm)

        if custom_decoder is not None:
            self.decoder = custom_decoder
        else:
            decoder_layer = TransformerDecoderLayer(
                d_model, nhead, dim_feedforward, dropout, activation,
                attn_dropout, act_dropout, normalize_before,
                decoder_weight_attr, decoder_bias_attr)
            decoder_norm = LayerNorm(d_model)
            self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers,
                                              decoder_norm)

        self.d_model = d_model
        self.nhead = nhead

    def forward(self, src, tgt, src_mask=None, tgt_mask=None, memory_mask=None):
        r"""
        Applies a Transformer model on the inputs.

        Parameters:
            src (Tensor): The input of Transformer encoder. It is a tensor
                with shape `[batch_size, source_length, d_model]`. The data type
                should be float32 or float64.
            tgt (Tensor): The input of Transformer decoder. It is a tensor
                with shape `[batch_size, target_length, d_model]`. The data type
                should be float32 or float64.
            memory (Tensor): The output of Transformer encoder. It is a tensor
                with shape `[batch_size, source_length, d_model]`. The data type
                should be float32 or float64.
            src_mask (Tensor, optional): A tensor used in multi-head attention
                to prevents attention to some unwanted positions, usually the
                paddings or the subsequent positions. It is a tensor with shape
                broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`.
                When the data type is bool, the unwanted positions have `False` 
                values and the others have `True` values. When the data type is 
                int, the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            tgt_mask (Tensor, optional): A tensor used in self attention
                to prevents attention to some unwanted positions, usually the
                the subsequent positions. It is a tensor with shape broadcasted
                to `[batch_size, n_head, target_length, target_length]`. When 
                the data type is bool, the unwanted positions have `False` 
                values and the others have `True` values. When the data type is 
                int, the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.
            memory_mask (Tensor, optional): A tensor used in decoder-encoder
                cross attention to prevents attention to some unwanted positions,
                usually the paddings. It is a tensor with shape broadcasted to
                `[batch_size, n_head, target_length, source_length]`. When the 
                data type is bool, the unwanted positions have `False` values 
                and the others have `True` values. When the data type is int, 
                the unwanted positions have 0 values and the others have 1 
                values. When the data type is float, the unwanted positions have 
                `-INF` values and the others have 0 values. It can be None when 
                nothing wanted or needed to be prevented attention to. Default None.

        Returns:
            Tensor: It is a tensor that has the same shape and data type \
                as `tgt`, representing the output of Transformer decoder.
        """
        src_mask = _convert_attention_mask(src_mask, src.dtype)
        memory = self.encoder(src, src_mask=src_mask)

        tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype)
        memory_mask = _convert_attention_mask(memory_mask, memory.dtype)
        output = self.decoder(tgt,
                              memory,
                              tgt_mask=tgt_mask,
                              memory_mask=memory_mask)
        return output


    def generate_square_subsequent_mask(self, length):
        """
        Generate a square mask for the sequence. The mask ensures that the
        predictions for position i can depend only on the known outputs at
        positions less than i.

        Parameters:
            length (int|Tensor): The length of sequence.

        Returns:
            Tensor: Generated square mask according to the given length.

        Examples:
            .. code-block:: python

                import paddle
                from paddle.nn.layer.transformer import Transformer
                length = 5
                d_model, n_head, dim_feedforward = 8, 4, 64
                transformer_paddle = Transformer(
                    d_model, n_head, dim_feedforward=dim_feedforward)
                mask = transformer_paddle.generate_square_subsequent_mask(length)
                print(mask)

                # [[  0. -inf -inf -inf -inf]
                # [  0.   0. -inf -inf -inf]
                # [  0.   0.   0. -inf -inf]
                # [  0.   0.   0.   0. -inf]
                # [  0.   0.   0.   0.   0.]]

        """
        return paddle.tensor.triu(
            paddle.full(shape=[length, length],
                        fill_value=-np.inf,
                        dtype=paddle.get_default_dtype()), 1)
        
        


if __name__ == "__main__":
    
    # src: [batch_size, tgt_len, d_model]
    enc_input = paddle.rand((2, 4, 128))
    # tgt: [batch_size, src_len, d_model]
    dec_input = paddle.rand((2, 6, 128))
    # src_mask: [batch_size, n_head, src_len, src_len]
    enc_self_attn_mask = paddle.rand((2, 2, 4, 4))
    # tgt_mask: [batch_size, n_head, tgt_len, tgt_len]
    dec_self_attn_mask = paddle.rand((2, 2, 6, 6))
    # memory_mask: [batch_size, n_head, tgt_len, src_len]
    cross_attn_mask = paddle.rand((2, 2, 6, 4))
    
    # [d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward]
    transformer = Transformer(128, 2, 4, 4, 512)

    output = transformer(enc_input,
                         dec_input,
                         enc_self_attn_mask,
                         dec_self_attn_mask,
                         cross_attn_mask)  # [2, 6, 128]

    print("Transformer", output.shape)

你可能感兴趣的:(每日一氵,paddlepaddle历险记,transformer,python,深度学习)

Python（四）——SVG 图坐标轴数字和其他文本设置总结八年。。 python 开发语言笔记
在学术论文中，图像的质量和规范性直接影响文章的专业性和表达效果。尤其是在使用Python绘制SVG图时，图像的字体选择、大小设置、以及整体样式需要符合期刊或会议的要求。这不仅能提升视觉呈现的清晰度，还能增强论文内容的可读性和说服力。因此，合理设置坐标轴字体（如数字使用“TimesNewRoman”、文字使用“宋体”）和调整图像细节是学术制图中不可忽视的重要环节。1.设置全局字体frommatplo
前端面试题-手写篇-万字长文！前端Jason 面试前端面试前端面试
1.手写实现EventBus实现一个简单的EventBus（事件总线）可以让我们在不同的组件或模块之间进行事件驱动的通信。下面是一个用JavaScript手写实现EventBus的基本例子：classEventBus{constructor(){this.events={};//存储事件名与对应的监听器}//注册事件监听器on(event,listener){if(!this.events[eve
《零基础Go语言算法实战》【题目 7-4】删除数组重复项，使每个元素只出现一次并返回新的长度廖显东-ShirDon 讲编程算法算法数据结构 go语言 go web web编程程序员 golang
《零基础Go语言算法实战》【题目7-4】删除数组重复项，使每个元素只出现一次并返回新的长度给定一个排序数组array，就地删除重复项，使每个元素只出现一次并返回新的长度。不要为另一个数组分配额外的空间，开发者必须通过使用空间复杂度为O(1)的额外内存就地修改输入数组来做到这一点。示例如下。输入：array=[5,5,6]输出：2【解答】①思路。本题可以通过希尔排序算法实现。注意本题中数组的删除并不
【Es】python es操作小毛驴吃梨子 elasticsearch python 大数据
表因为es是集群所以es_hosts是列表fromelasticsearchimportElasticsearchES_HOSTS=["127.0.0.1:9200"]ES_HTTP_AUTH="******************"#连接Eses=Elasticsearch(hosts=ES_HOSTS,http_auth=ES_HTTP_AUTH,maxsize=60,timeout=30,m
SpringBoot集成Netty实战：构建高效TCPUDP通信服务端【物联网开发必备】 m0_74825678 面试学习路线阿里巴巴 spring boot 物联网后端
SpringBoot集成Netty实现TCP/UDP通信协议【优化版】引言在现代物联网(IoT)应用中，设备与服务器之间的实时通信至关重要。Netty作为一个高性能的网络应用框架，与SpringBoot的集成可以简化开发过程，并提高应用性能。本文将详细介绍如何在SpringBoot中集成Netty，实现TCP和UDP通信协议。通讯协议在设计通讯协议时，我们考虑了数据的完整性和命令的明确性。以下是我
使用 MySQL 从 JSON 字符串提取数据 m0_74825678 面试学习路线阿里巴巴 mysql json oracle
使用MySQL从JSON字符串提取数据在现代数据库管理中，JSON格式因其灵活性而广泛使用。然而，当数据存储在JSON中时，我们经常需要将其转换为更易于处理的格式。本篇文章将通过一个具体的SQL查询示例，展示如何从存储在MySQL中的JSON字符串提取数据并重新格式化。1.背景知识JSON（JavaScriptObjectNotation）是一种轻量级的数据交换格式，易于阅读和编写，同时也易于机器
Docker 安装MySQL 5.7(超详细文图说明及MySQL配置) m0_74823317 面试学习路线阿里巴巴资料职业发展 docker mysql adb 后端
1)下载MySQL5.7镜像#默认下载MySQL5.7最新版本(其他版本可以指定比如dockerpullmysql:5.7.34)dockerpullmysql:5.72)查看已下载的docker镜像dockerimages3)创建MySQL容器并运行方式一(快捷方式,仅配置root密码)dockerrun--namemysql5.7-p3306:3306-eMYSQL_ROOT_PASSWORD
Python中Cache的使用爬虫俗手小马达 python 开发语言缓存
文章目录一、缓存的基础概念二、基础使用三、进阶使用四、外部缓存工具五、缓存的注意事项一、缓存的基础概念缓存（Cache）是一种在应用程序中提升性能的技术，它通过将一些数据临时存储在快速访问的存储介质（如内存）中，以减少数据的重复计算或重复读取。通常，缓存用于存储一些昂贵计算或IO密集型操作的结果，从而加快程序的执行速度。在Python中，缓存通常用于函数的输出、API请求的结果、数据库查询、文件读
Python学习：Pandas库使用（二）之读写Excel文件——read_excel()和to_excel()函数及其参数详解爬虫俗手小马达 python 学习 pandas
在Python的Pandas库中，读取和写入Excel文件主要使用read_excel和to_excel函数。以下是详细用法和示例：1.读取Excel文件：pd.read_excel()importpandasaspd#读取Excel文件df=pd.read_excel('文件路径.xlsx',sheet_name='Sheet1',header=0,usecols='A:C',skiprows=
Python学习——装饰器（一）：两个简单例子爬虫俗手小马达 python 学习开发语言
例一计时器#创建一个装饰器，用于计算函数执行时间importtimedeftime_this(func):defwrapper(*args,**kwargs):start_time=time.time()result=func(*args,**kwargs)end_time=time.time()execution_time=end_time-start_timeprint(f"Execution
基于YOLOv5、YOLOv8和YOLOv10的机场安检行李检测：深度学习应用与实现 2025年数学建模美赛 YOLO 深度学习人工智能目标跟踪目标检测
引言随着全球航空运输业的持续增长，机场的安全性变得越来越重要。机场安检作为航空安全的重要组成部分，主要负责对乘客和行李进行检查，防止危险物品进入机场或飞行器。传统的安检方式多依赖人工检查，效率低下且容易出错。因此，基于深度学习的自动化行李检测系统应运而生，通过计算机视觉技术，自动识别和分类行李中的物品，大大提高了安检的效率与准确性。YOLO（YouOnlyLookOnce）系列算法，由于其高效的目
WPS不登录无法使用基本功能的解决方案愚公移山填海经验分享
前言WPS不登录无法使用基本功能的原因通常是为了同步数据、提供更多高级功能或满足软件授权要求。‌然而，一些用户可能出于隐私或便捷性的考虑，不愿意登录账号。在这种情况下，WPS可能会限制未登录用户的使用权限，导致工具栏变灰，无法使用基本功能。‌解决方法1.使用配置工具进行重置修复‌打开WPS配置工具，进入高级设置界面。选择“重置修复”选项，然后点击“重置工具栏”。完成修复后，重启WPS软件以确保设置
linux系统的目录结构 Petrus_shuai linux linux系统目录结构
一.目录结构详述linux系统的目录结构最顶端是“/”，一切目录从根开始。我们可以通过tree命令得到根（/）下的目录结构。[root@linux01~]#tree-L1//├──bin->usr/binusr/libusr/lib64├──media├──mnt实际都是内存中的信息├──rootusr/sbinrc.d/init.d说明：/etc/init.d等价于/etc/rc.d/init.
2025 年成为 AI 独立开发者的 3 个步骤程序员陆通人工智能
2025年成为AI独立开发者的3个步骤每天拆解一个AI应用或模型功能选择一个热门的AI应用或开源模型（如ChatGPT、MidJourney、Whisper），深度体验其核心功能，分析背后的技术实现。用笔记工具记录其亮点、缺点，以及你认为可以改进的地方。思考如何通过自己的开发能力优化这些功能，形成自己的产品思路。每天学习1小时AI开发相关技能针对独立开发者需要的核心技能，每天学习一点点，比如：如何
HarmonyOS 开发实践——基于设置应用的应用权限、通知设置跳转六号嘉宾鸿蒙开发移动开发 HarmonyOS harmonyos 架构 ui 鸿蒙鸿蒙系统移动开发鸿蒙开发
往期学习笔录：鸿蒙（HarmonyOS）北向开发知识点记录~鸿蒙（OpenHarmony）南向开发保姆级知识点汇总~鸿蒙应用开发与鸿蒙系统开发哪个更有前景？嵌入式开发适不适合做鸿蒙南向开发？看完这篇你就了解了~对于大前端开发来说，转鸿蒙开发究竟是福还是祸？鸿蒙岗位需求突增！移动端、PC端、IoT到底该怎么选？记录一场鸿蒙开发岗位面试经历~持续更新中……场景描述引导用户跳转到系统设置页进行权限，通知
人工智能之数学基础：一个小例子帮你快速搞懂极大线性无关向量组每天五分钟玩转人工智能机器学习深度学习之数学基础人工智能线性代数机器学习极大线性无关向量组深度学习神经网络
本文重点在上一节课程中，我们学习了线性相关和线性无关。当线性相关的时候，那么说明这组向量至少存在一个向量可以被其它向量给表示，可以被表示就说明这个向量就是可有可无的，可以被替代的，这里就涉及到极大线性无关向量组的概念了，本文对此进行学习。极大无关向量组的定义与性质定义在线性空间中，如果存在一个向量组，它满足以下两个条件：一是它本身是线性无关的；二是向量空间中的任何包含它的向量组，如果仍然保持线性无
PyInstaller 打包 exe 文件 cliffordl python 综合 python 开发语言
PyInstaller是一个第三方库，它能够在Windows、Linux、MacOSX等操作系统下将Python源文件打包。通过对源文件打包，Python程序可以在没有安装Python的环境中运行，也可以作为一个独立文件方便传递和管理。PyInstaller支持Python2.7和Python3.3+。可以在Windows、MacOSX和Linux上使用，但是并不是跨平台的，而是说你要是希望打包成
FLASK+VUE--前后端分离（三）- VUE+Element-UI搭建登陆页面且能够正常登陆 begefefsef 前端 html css css3 前端
FLASK+VUE–前后端分离（一）-Flask基础讲解之路由、视图函数及代码实现FLASK+VUE–前后端分离（二）-VUE基础安装及项目的简易介绍FLASK+VUE–前后端分离（三）-VUE+Element-UI搭建登陆页面且能够正常登陆FLASK+VUE–前后端分离（四）-VUE+Element-UI简单搭建主页布局FLASK+VUE–前后端分离（五）-VUE测试/线上/开发环境地址配置+拦
Ruby Web开发框架的介绍及示例代码 YurwRuby ruby 前端开发语言
Ruby是一种简洁而强大的编程语言，广泛用于Web开发。在Ruby生态系统中，有几种实用型的Web开发框架，它们提供了丰富的功能和工具，帮助开发者快速构建可靠的Web应用程序。下面将介绍几种常用的RubyWeb开发框架，并提供相应的示例代码。RubyonRails（Rails）RubyonRails，简称Rails，是Ruby最知名的Web开发框架之一。Rails采用了MVC（Model-View
强大的骚操作，9种不同的方法帮助你提高国内访问Github的速度程序员大伟 Java 学习 java github linux git svn
1.GitHub镜像访问这里提供两个最常用的镜像地址：https://github.com.cnpmjs.orghttps://hub.fastgit.org也就是说上面的镜像就是一个克隆版的Github，你可以访问上面的镜像网站，网站的内容跟Github是完整同步的镜像，然后在这个网站里面进行下载克隆等操作。2.GitHub文件加速利用CloudflareWorkers对githubreleas
Pytorch: torch.diag()创建对角线张量湫兮之风 pytorch pytorch 人工智能 python
torch.diag()torch.diag是PyTorch中的一个函数，用于从给定的矩阵中提取对角线元素，或者构造一个以给定对角线元素为值的对角矩阵。这个函数对于矩阵分解和转换等操作非常重要。如果输入是一个向量（1D张量），torch.diag会返回一个以该向量为对角线元素的2D方阵。如果输入是一个矩阵（2D张量），则返回一个包含输入矩阵对角线元素的1D张量。torch.diag还允许你指定对角
如何去控制大量请求并发？沐雨MUYU_ 前端 javascript
并发请求是在同一时刻发起大量请求，众所周知，浏览器发起的请求最大并发数量一般都是6~8个，这是因为浏览器会限制同一域名下的并发请求数量，以避免对服务器造成过大的压力。首先来模拟大量请求的场景constids=newArray(100).fill('')console.time()for(leti=0;i{}接受一个参数reqs，它是一个数组，包含需要发送的请求。函数的主要目的是对这些请求进行队列管
C++：inline函数的作用湫兮之风 c++c++算法开发语言
1.基本概念inline是C++中的一个关键字，用于建议编译器将函数的调用替换为函数体本身，而不是执行传统的函数调用操作。函数调用通常涉及将参数压栈、跳转到函数代码处执行、返回结果等操作，对于一些小的、频繁调用的函数，这些开销可能会影响性能。使用inline可以避免这些开销。2.示例代码#include//定义一个inline函数inlineintadd(inta,intb){returna+b;
OpenCV: 深入理解OpenCV中CV_WRAP_AS宏及其作用湫兮之风 opencv opencv 人工智能计算机视觉
在OpenCV中，CV_WRAP_AS是一个宏，主要用于为C++函数或运算符定义别名，以便在生成语言绑定时使用。这对于在不同的编程语言（如Python）中使用OpenCV库时提供更友好的接口非常有用。尽管它在C++代码中不会改变函数的行为，但它在OpenCV的语言绑定系统中起到了重要作用，特别是当OpenCV要为多个语言（如Python）提供接口时。1.CV_WRAP_AS宏的基本用途CV_WRA
代码重构的革命：AI代码生成器如何改变游戏规则前端
在软件开发的世界里，代码重构是一项既重要又艰巨的任务。繁琐的重复性工作、低下的效率以及难以避免的错误，常常让开发者们疲惫不堪。然而，随着人工智能技术的飞速发展，智能化代码重构的时代已经到来，而AI代码生成器正成为这场革命的核心驱动力。代码重构的挑战：一个开发者的心声传统的代码重构过程充满了挑战。想象一下，你需要将一个庞大的、混乱的代码库改造成模块化、易于维护的结构。这需要你花费大量的时间去理解现有
SpringCloud系列——5Spring Cloud 源码分析之OpenFeign 木木_2024 SpringCloud系列 spring cloud java spring 架构
学习目标为什么加一个注解就能实现远程过程调用呢？推导它底层的实现主流程？OpenFeign怎么实现RPC的基本功能的通过源码验证第1章OpenFeign主流程推导要明确OpenFeign的主流程首先我们还是要明确它的核心目标是什么？说白了，OpenFeign最核心的目标就是让客户端在远程调用过程中不需要做什么多余的操作，只要拿到一个对象，然后调用该对象的方法就好了，剩下的操作都交给OpenFeig
看板工具提升敏捷管理：实现透明、高效的团队协作与进度管理敏捷看板类协作工具
引言随着科技的快速发展与市场需求的不断变化，企业的管理方式也发生了深刻的变革。传统的项目管理方法渐渐无法满足当今企业面对的高效性、灵活性与快速响应的要求。特别是在研发、产品设计、市场营销等多个领域，团队需要更加灵活和透明的工作流管理方式。在这种背景下，敏捷管理应运而生。作为敏捷管理中的一种有效工具，看板（Kanban）凭借其高效、简洁且灵活的特点，已成为全球各行业中团队管理的重要组成部分。本篇文章
python连接elasticsearch实战（附完整代码）当初 python elasticsearch
python连接elasticsearchfromelasticsearchimportElasticsearchfromelasticsearch.helpersimportscanES_HOSTS=[{'host':'','port':9200,'scheme':'http'}]es=Elasticsearch(hosts=ES_HOSTS,basic_auth=('账号','密码'))#检查
mongodb清理删除历史数据程序员
批量清理mongodb历史数据清理程序的原来目前项目组上很多平台上线历史数据积压，导致入库查询数据缓慢，历史数据有些已经归档，进行历史数据清理删除。之前临时写shell脚本，太简陋，重新使用Python进行改造，新增备份功能，和配置文件删除指定字段和时间范围内数据。代码篇#!/usr/local/python3/bin/python3importconfigparser,logging.confi
Ruby语言详解编程小郭 ruby 开发语言后端
Ruby语言详解Ruby，作为一种简单快捷的面向对象脚本语言，自20世纪90年代由日本人松本行弘（YukihiroMatsumoto）开发以来，便以其独特的魅力和强大的功能赢得了全球开发者的青睐。Ruby不仅继承了Perl、Smalltalk、Eiffel、Ada以及Lisp等多种语言的优点，还发展出了自己的特色和风格。一、Ruby语言的特点面向对象：Ruby从一开始就被设计成纯粹的面向对象语言，
jdk tomcat 环境变量配置 Array_06 java jdk tomcat
Win7 下如何配置java环境变量 1。准备jdk包，win7系统，tomcat安装包（均上网下载即可） 2。进行对jdk的安装，尽量为默认路径（但要记住啊！！以防以后配置用。。。） 3。分别配置高级环境变量。电脑-->右击属性-->高级环境变量-->环境变量。分别配置 : path &nbs
Spring调SDK包报java.lang.NoSuchFieldError错误 bijian1013 java spring
在工作中调另一个系统的SDK包，出现如下java.lang.NoSuchFieldError错误。 org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.l
LeetCode[位运算] - #136 数组中的单一数 Cwind java 题解位运算 LeetCode Algorithm
原题链接：#136 Single Number 要求：给定一个整型数组，其中除了一个元素之外，每个元素都出现两次。找出这个元素注意：算法的时间复杂度应为O(n)，最好不使用额外的内存空间难度：中等分析：题目限定了线性的时间复杂度，同时不使用额外的空间，即要求只遍历数组一遍得出结果。由于异或运算 n XOR n = 0, n XOR 0 = n，故将数组中的每个元素进
qq登陆界面开发 15700786134 qq
今天我们来开发一个qq登陆界面，首先写一个界面程序，一个界面首先是一个Frame对象，即是一个窗体。然后在这个窗体上放置其他组件。代码如下： public class First { public void initul(){ jf=ne
Linux的程序包管理器RPM 被触发 linux
在早期我们使用源代码的方式来安装软件时，都需要先把源程序代码编译成可执行的二进制安装程序，然后进行安装。这就意味着每次安装软件都需要经过预处理-->编译-->汇编-->链接-->生成安装文件--> 安装，这个复杂而艰辛的过程。为简化安装步骤，便于广大用户的安装部署程序，程序提供商就在特定的系统上面编译好相关程序的安装文件并进行打包，提供给大家下载，我们只需要根据自己的
socket通信遇到EOFException 肆无忌惮_ EOFException
java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2281) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:
基于spring的web项目定时操作知了ing java Web
废话不多说，直接上代码，很简单配置一下项目启动就行 1，web.xml <?xml version="1.0" encoding="UTF-8"?> <web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="h
树形结构的数据库表Schema设计矮蛋蛋 schema
原文地址： http://blog.csdn.net/MONKEY_D_MENG/article/details/6647488 程序设计过程中，我们常常用树形结构来表征某些数据的关联关系，如企业上下级部门、栏目结构、商品分类等等，通常而言，这些树状结构需要借助于数据库完成持久化。然而目前的各种基于关系的数据库，都是以二维表的形式记录存储数据信息，
maven将jar包和源码一起打包到本地仓库 alleni123 maven
http://stackoverflow.com/questions/4031987/how-to-upload-sources-to-local-maven-repository <project> ... <build> <plugins> <plugin> <groupI
java IO操作与 File 获取文件或文件夹的大小，可读，等属性！！！百合不是茶
类 File File是指文件和目录路径名的抽象表示形式。 1，何为文件：标准文件（txt doc mp3...）目录文件（文件夹）虚拟内存文件 2，File类中有可以创建文件的 createNewFile（）方法,在创建新文件的时候需要try{} catch(）{}因为可能会抛出异常；也有可以判断文件是否是一个标准文件的方法isFile();这些防抖都
Spring注入有继承关系的类（2） bijian1013 java spring
被注入类的父类有相应的属性，Spring可以直接注入相应的属性，如下所例：1.AClass类 package com.bijian.spring.test4; public class AClass { private String a; private String b; public String getA() { retu
30岁转型期你能否成为成功人士 bijian1013 成长励志
很多人由于年轻时走了弯路，到了30岁一事无成，这样的例子大有人在。但同样也有一些人，整个职业生涯都发展得很优秀，到了30岁已经成为职场的精英阶层。由于做猎头的原因，我们接触很多30岁左右的经理人，发现他们在职业发展道路上往往有很多致命的问题。在30岁之前，他们的职业生涯表现很优秀，但从30岁到40岁这一段，很多人
【Velocity四】Velocity与Java互操作 bit1129 velocity
Velocity出现的目的用于简化基于MVC的web应用开发，用于替代JSP标签技术，那么Velocity如何访问Java代码.本篇继续以Velocity三http://bit1129.iteye.com/blog/2106142中的例子为基础， POJO package com.tom.servlets; public
【Hive十一】Hive数据倾斜优化 bit1129 hive
什么是Hive数据倾斜问题操作：join,group by,count distinct 现象：任务进度长时间维持在99%（或100%），查看任务监控页面，发现只有少量（1个或几个）reduce子任务未完成；查看未完成的子任务，可以看到本地读写数据量积累非常大，通常超过10GB可以认定为发生数据倾斜。原因：key分布不均匀倾斜度衡量：平均记录数超过50w且
在nginx中集成lua脚本：添加自定义Http头，封IP等 ronin47 nginx lua csrf
Lua是一个可以嵌入到Nginx配置文件中的动态脚本语言，从而可以在Nginx请求处理的任何阶段执行各种Lua代码。刚开始我们只是用Lua 把请求路由到后端服务器，但是它对我们架构的作用超出了我们的预期。下面就讲讲我们所做的工作。强制搜索引擎只索引mixlr.com Google把子域名当作完全独立的网站，我们不希望爬虫抓取子域名的页面，降低我们的Page rank。 location /{
java-3.求子数组的最大和 bylijinnan java
package beautyOfCoding; public class MaxSubArraySum { /** * 3.求子数组的最大和题目描述：输入一个整形数组，数组里有正数也有负数。数组中连续的一个或多个整数组成一个子数组，每个子数组都有一个和。求所有子数组的和的最大值。要求时间复杂度为O(n)。例如输入的数组为1, -2, 3, 10, -4,
Netty源码学习-FileRegion bylijinnan java netty
今天看org.jboss.netty.example.http.file.HttpStaticFileServerHandler.java 可以直接往channel里面写入一个FileRegion对象，而不需要相应的encoder： //pipeline（没有诸如“FileRegionEncoder”的handler）： public ChannelPipeline ge
使用ZeroClipboard解决跨浏览器复制到剪贴板的问题 cngolon 跨浏览器复制到粘贴板 Zero Clipboard
Zero Clipboard的实现原理 Zero Clipboard 利用透明的Flash让其漂浮在复制按钮之上，这样其实点击的不是按钮而是 Flash ，这样将需要的内容传入Flash，再通过Flash的复制功能把传入的内容复制到剪贴板。 Zero Clipboard的安装方法首先需要下载 Zero Clipboard的压缩包，解压后把文件夹中两个文件：ZeroClipboard.js
单例模式 cuishikuan 单例模式
第一种（懒汉，线程不安全）： public class Singleton { 2 private static Singleton instance; 3 pri
spring+websocket的使用 dalan_123
一、spring配置文件 <?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.or
细节问题：ZEROFILL的用法范围。 dcj3sjt126com mysql
1、zerofill把月份中的一位数字比如1，2，3等加前导0 mysql> CREATE TABLE t1 (year YEAR(4), month INT(2) UNSIGNED ZEROFILL, -> day
Android开发10——Activity的跳转与传值 dcj3sjt126com Android开发
Activity跳转与传值，主要是通过Intent类，Intent的作用是激活组件和附带数据。一、Activity跳转方法一Intent intent = new Intent(A.this, B.class); startActivity(intent) 方法二Intent intent = new Intent();intent.setCla
jdbc 得到表结构、主键 eksliang jdbc 得到表结构、主键
转自博客：http://blog.csdn.net/ocean1010/article/details/7266042 假设有个con DatabaseMetaData dbmd = con.getMetaData(); rs = dbmd.getColumns(con.getCatalog(), schema, tableName, null); rs.getSt
Android 应用程序开关GPS gqdy365 android
要在应用程序中操作GPS开关需要权限： <uses-permission android:name="android.permission.WRITE_SECURE_SETTINGS" /> 但在配置文件中添加此权限之后会报错，无法再eclipse里面正常编译，怎么办？ 1、方法一：将项目放到Android源码中编译； 2、方法二：网上有人说cl
Windows上调试MapReduce zhiquanliu mapreduce
1.下载hadoop2x-eclipse-plugin https://github.com/winghc/hadoop2x-eclipse-plugin.git 把 hadoop2.6.0-eclipse-plugin.jar 放到eclipse plugin 目录中。 2.下载 hadoop2.6_x64_.zip http://dl.iteye.com/topics/download/d2b
如何看待一些知名博客推广软文的行为？ justjavac 博客
本文来自我在知乎上的一个回答：http://www.zhihu.com/question/23431810/answer/24588621 互联网上的两种典型心态：当初求种像条狗，如今撸完嫌人丑当初搜贴像条犬，如今读完嫌人软你为啥感觉不舒服呢？难道非得要作者把自己的劳动成果免费给你用，你才舒服？就如同 Google 关闭了 Gooled Reader，那是
sql优化总结 macroli sql
为了是自己对sql优化有更好的原则性，在这里做一下总结，个人原则如有不对请多多指教。谢谢！要知道一个简单的sql语句执行效率，就要有查看方式，一遍更好的进行优化。一、简单的统计语句执行时间 declare @d datetime ---定义一个datetime的变量set @d=getdate() ---获取查询语句开始前的时间select user_id
Linux Oracle中常遇到的一些问题及命令总结超声波 oracle linux
1.linux更改主机名 (1)#hostname oracledb　　　　临时修改主机名 (2) vi /etc/sysconfig/network 　　修改hostname (3) vi /etc/hosts　　　　　　　　修改IP对应的主机名 2.linux重启oracle实例及监听的各种方法（注意操作的顺序应该是先监听，后数据库实例） &nbs
hive函数大全及使用示例 superlxw1234 hadoop hive函数
具体说明及示例参见附件文档。文档目录：目录一、关系运算： 4 1. 等值比较: = 4 2. 不等值比较: <> 4 3. 小于比较: < 4 4. 小于等于比较: <= 4 5. 大于比较: > 5 6. 大于等于比较: >= 5 7. 空值判断: IS NULL 5
Spring 4.2新特性-使用@Order调整配置类加载顺序 wiselyman spring 4
4.1 @Order Spring 4.2 利用@Order控制配置类的加载顺序 4.2 演示两个演示bean package com.wisely.spring4_2.order; public class Demo1Service { } package com.wisely.spring4_2.order; public class