关于 Positional Encoding的理解

encoding

Sinusoidal Position Encoding

P E ( p o s , 2 i ) = sin ⁡ ( p o s 1000 0 2 i d model ) P E ( p o s , 2 i + 1 ) = cos ⁡ ( p o s 1000 0 2 i d model ) \begin{aligned} P E_{(p o s, 2 i)} &=\sin \left(\frac{p o s}{10000^{\frac{2 i}{d_{\text {model}}}}}\right) \\ P E_{(p o s, 2 i+1)} &=\cos \left(\frac{p o s}{10000^{\frac{2 i}{d_{\text {model}}}}}\right) \end{aligned} PE(pos,2i)PE(pos,2i+1)=sin(10000dmodel2ipos)=cos(10000dmodel2ipos)
pos + k 位置的encoding可以通过pos位置的encoding线性表示。它们的关系可以通过三角函数公式体现:
sin ⁡ ( α + β ) = sin ⁡ α ⋅ cos ⁡ β + cos ⁡ α ⋅ sin ⁡ β cos ⁡ ( α + β ) = cos ⁡ α ⋅ cos ⁡ β − sin ⁡ α ⋅ sin ⁡ β \begin{array}{l} \sin (\alpha+\beta)=\sin \alpha \cdot \cos \beta+\cos \alpha \cdot \sin \beta \\ \cos (\alpha+\beta)=\cos \alpha \cdot \cos \beta-\sin \alpha \cdot \sin \beta \end{array} sin(α+β)=sinαcosβ+cosαsinβcos(α+β)=cosαcosβsinαsinβ

位置为 pos + k 的positional encoding 可以表示如下:
P E ( p o s + k , 2 i ) = sin ⁡ ( w i ⋅ ( p o s + k ) ) = sin ⁡ ( w i p o s ) cos ⁡ ( w i k ) + cos ⁡ ( w i p o s ) sin ⁡ ( w i k ) P E ( p o s + k , 2 i + 1 ) = cos ⁡ ( w i ⋅ ( p o s + k ) ) = cos ⁡ ( w i p o s ) cos ⁡ ( w i k ) − sin ⁡ ( w i p o s ) sin ⁡ ( w i k ) w i = 1 1000 0 2 i / d model \begin{array}{l} P E_{(p o s+k, 2 i)}=\sin \left(w_{i} \cdot(p o s+k)\right)=\sin \left(w_{i} p o s\right) \cos \left(w_{i} k\right)+\cos \left(w_{i} p o s\right) \sin \left(w_{i} k\right) \\ P E_{(p o s+k, 2 i+1)}=\cos \left(w_{i} \cdot(p o s+k)\right)=\cos \left(w_{i} p o s\right) \cos \left(w_{i} k\right)-\sin \left(w_{i} p o s\right) \sin \left(w_{i} k\right) \end{array} \\ w_{i}=\frac{1}{10000^{2 i / d_{\text {model}}}} PE(pos+k,2i)=sin(wi(pos+k))=sin(wipos)cos(wik)+cos(wipos)sin(wik)PE(pos+k,2i+1)=cos(wi(pos+k))=cos(wipos)cos(wik)sin(wipos)sin(wik)wi=100002i/dmodel1

化简如下:
P E ( p o s + k , 2 i ) = cos ⁡ ( w i k ) P E ( p o s , 2 i ) + sin ⁡ ( w i k ) P E ( p o s , 2 i + 1 ) P E ( p o s + k , 2 i + 1 ) = cos ⁡ ( w i k ) P E ( p o s , 2 i + 1 ) − sin ⁡ ( w i k ) P E ( p o s , 2 i ) ) \begin{aligned} P E_{(p o s+k, 2 i)} &=\cos \left(w_{i} k\right) P E_{(p o s, 2 i)}+\sin \left(w_{i} k\right) P E_{(p o s, 2 i+1)} \\ P E_{(p o s+k, 2 i+1)} &\left.=\cos \left(w_{i} k\right) P E_{(p o s, 2 i+1)}-\sin \left(w_{i} k\right) P E_{(p o s, 2 i)}\right) \end{aligned} PE(pos+k,2i)PE(pos+k,2i+1)=cos(wik)PE(pos,2i)+sin(wik)PE(pos,2i+1)=cos(wik)PE(pos,2i+1)sin(wik)PE(pos,2i))

其中与k相关的项都是常数,所以 P E p o s + k PE_{pos+k} PEpos+k 可以被 P E p o s PE_{pos} PEpos 线性表示。

由于
P E ( p o s , 2 i ) = sin ⁡ ( p o s ⋅ 1 1000 0 2 i d model ) T = 2 π ⋅ 1000 0 2 i d m o d e l P E_{(p o s, 2 i)} =\sin \left(pos \cdot \frac{1}{10000^{\frac{2 i}{d_{\text {model}}}}}\right) \\ T = 2 \pi \cdot 10000^{\frac{2i}{d_model}} PE(pos,2i)=sin(pos10000dmodel2i1)T=2π10000dmodel2i

所以i越大,周期就越大。周期的范围从 2 π 2 \pi 2π 2 π ⋅ 10000 2 \pi \cdot 10000 2π10000

Bert 中的 positional encoding

源码:

class BertEmbeddings(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)	# (vocab_size, hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)			# (512, hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)				# (2, hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = BertLayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

Bert 中的embedding是用三个embedding加起来的, positional encoding 也没有采用transformer中的三角函数,而是通过Embedding层训练得到。

你可能感兴趣的:(NLP,Transformer,神经网络)