基于核化相对位置嵌入的长度外推框架:KERPLE

【量子阅读】

【摘要】

本文介绍了一种新的框架KERPLE,用于对长度进行外推的相对位置嵌入。主要贡献包括:

  1. 框架提出:KERPLE框架通过核函数化相对位置嵌入来实现长度外推,通过条件正定(CPD)核函数来扩展位置差的表示。

  2. CPD核函数的应用:CPD核函数可以隐式转换为正定(PD)核函数,从而保持自我注意力的内积解释。这种转换使得KERPLE能够利用更多距离的令牌信息。

  3. 实验结果:在三个大型语言建模数据集上,对数变体(logarithmic variant)在长度外推性能上表现出色。与其他变体相比,KERPLE-log在GitHub和ArXiv数据集上的表现尤其突出。

  4. 理论基础:KERPLE框架基于条件正定核函数和偏移不变核函数的理论,这些核函数可以扩展距离度量。

  5. 相对位置嵌入的改进:KERPLE通过引入复合核函数(composite kernel)来增强相对位置信息,从而提高了模型在外推任务上的性能。

  6. 扩展实验:在不同长度的训练和测试设置、大型模型以及Wikitext-103数据集上进行了额外实验,结果表明KERPLE在长序列建模和长度外推方面具有优势。

总结来说,KERPLE通过引入新的核函数机制,显著提高了模型在外推任务上的性能,并在多个数据集上验证了其有效性。

【数据来源】

论文的主要数据来源如下:

  1. 实验数据集
    • OpenWebText2
    • GitHub
    • ArXiv

这些数据集用于训练和评估模型,并比较不同方法的效果。这些数据集分别来自互联网、开源代码库和学术论文,涵盖了不同领域的文本数据。

  1. 模型实现

    • 使用GPT-NeoX代码库进行实现,该库由EleutherAI团队开发,基于NVIDIA的Megatron语言模型,并使用Microsoft DeepSpeed库进一步加速。
  2. 基准模型和方法

    • ALiBi
    • Rotary
    • T5
    • Sinusoidal
    • KERPLE(相对位置嵌入的核化版本)

这些基准模型用于比较和验证新提出的KERPLE框架的效果。

  1. 超参数设置
    • 所有超参数(除批量大小外)几乎都来自GPT-NeoX的实现。
    • 使用相同的配置进行训练,以确保公平比较。

总结来说,实验数据主要来源于公开的大型语言模型数据集,模型实现基于GPT-NeoX代码库,基准方法包括已有的相对位置嵌入模型和一些经典的序列建模方法。

【模型架构】

KERPLE: A Framework for Kernelized Relative Positional Embedding for Length Extrapolation

Authors:

  • Ta-Chung Chi
  • Ting-Han Fan
  • Peter J. Ramadge
  • Alexander I. Rudnicky

Abstract:
KERPLE (Kernelized Relative Positional Embedding for Length Extrapolation) is a framework that generalizes relative positional embedding for sequence modeling tasks by kernelizing positional differences. The authors propose this approach to enable better length extrapolation, a challenge faced by many transformer models. The framework is designed to work with shift-invariant conditionally positive definite (CPD) kernels, which are transformed into positive definite (PD) kernels to maintain the inner product interpretation of self-attention.

Key Contributions:

  1. Framework for KERPLE: Introduces a method to kernelize relative positional embeddings to enable length extrapolation.
  2. CPD Kernels: Utilizes CPD kernels, a class of functions that can model distance metrics, to derive various relative positional embeddings.
  3. Experiments: Demonstrates the effectiveness of the logarithmic variant of KERPLE on large language modeling datasets with improved extrapolation performance.

Model Architecture:

  • Relative Positional Embeddings (RPE): Models the positional difference between tokens.
  • Kernelization: Uses CPD kernels to model the relative positional differences.
  • Transformation to PD Kernels: Adds a constant offset to the CPD kernel to ensure the kernel is positive definite, which is necessary for self-attention.

Experiments:

  • Datasets: OpenWebText2, GitHub, and ArXiv.
  • Evaluation Metrics: Perplexity and training time.
  • Findings:
    • The logarithmic variant of KERPLE consistently outperforms other baselines across different datasets.
    • The logarithmic variant is 9.7% faster than T5 on average.
    • The logarithmic variant is more effective in long-range extrapolation compared to other models.

Detailed Model Overview:

  1. Input Embeddings: Query, key, and value vectors are generated for each token.
  2. Relative Positional Embeddings: Relative positional differences are modeled using CPD kernels.
  3. Kernel Function: The kernel function is applied to the relative positional differences to generate the embeddings.
  4. Softmax Normalization: The transformed embeddings are normalized using Softmax to maintain the inner product interpretation.

Key Equations:

  • Kernel Function:
    [
    a_{m,n} = \exp\left(\frac{q_m^T k_n + \tilde{k}{r_1, \dots, r\ell}(m, n)}{\sqrt{d}}\right) \bigg/ \sum_{i=1}^L \exp\left(\frac{q_m^T k_i + \tilde{k}{r_1, \dots, r\ell}(m, i)}{\sqrt{d}}\right)
    ]
  • Shift-Invariant Property:
    [
    \exp(x_i) \sum_{j=1}^n \exp(x_j) = \exp(x_i + c) \sum_{j=1}^n \exp(x_j + c)
    ]

Conclusion:

The KERPLE framework effectively addresses the challenge of length extrapolation in transformer models. By leveraging CPD kernels, the model is able to capture long-range dependencies more effectively, le

你可能感兴趣的:(人工智能,语言模型)