https://www.aisolink.com/build-your-own-llama-3-architecture-from-scratch-using-pytorch
全文摘要 本文提供了一个详细的指南,介绍如何使用PyTorch从头开始构建Llama 3模型的完整架构,并对自定义数据集进行训练和推理。文章涵盖了构建输入块、解码器块和输出块的步骤,并提供了相应的代码示例。最终目标是构建一个功能齐全的Llama 3模型,能够根据输入提示生成新文本。
现在我们知道我们想要实现什么,让我们开始逐步构建一切。
如上面的 Llama 3 架构图所示,输入块有 3 个组件:文本/提示、分词器和嵌入。输入块内的组件如何工作? 俗话说“一图胜千言”,让我们通过下面的流程图来了解Input块内部的工作流程。
# Import necessary libraries
import torch
from torch import nn
from torch.nn import functional as F
import math
import numpy as np
import time
from dataclasses import dataclass
from typing import Optional, Tuple, List
import pandas as pd
from matplotlib import pyplot as plt
### Step 1: Input Block ###
# Using Tiny Shakespeare dataset for character-level tokenizer. Some part of the following character-level tokenizer is referenced from Andrej karpathy's GitHub (https://github.com/karpathy/nanoGPT/blob/master/data/shakespeare_char/prepare.py) which I found is explained very well.
# Load tiny_shakespeare data file (https://github.com/tamangmilan/llama3/blob/main/tiny_shakespeare.txt)
device: str = 'cuda' if torch.cuda.is_available() else 'cpu' # Assign device to cuda or cpu based on availability
# Load tiny_shakespeare data file.
with open('tiny_shakespeare.txt', 'r') as f:
data = f.read()
# Prepare vocabulary by taking all the unique characters from the tiny_shakespeare data
vocab = sorted(list(set(data)))
# Training Llama 3 model requires addtional tokens such as <|begin_of_text|>, <|end_of_text|> and <|pad_id|>, we'll add them into vocabulary
vocab.extend(['<|begin_of_text|>','<|end_of_text|>','<|pad_id|>'])
vocab_size = len(vocab)
# Create a mapping between characters with corresponding integer indexes in vocabulary.
# This is important to build tokenizers encode and decode functions.
itos = {i:ch for i, ch in enumerate(vocab)}
stoi = {ch:i for i, ch in enumerate(vocab)}
# Tokenizers encode function: take a string, output a list of integers
def encode(s):
return [stoi[ch] for ch in s]
# Tokenizers decode function: take a list of integers, output a string
def decode(l):
return ''.join(itos[i] for i in l)
# Define tensor token variable to be used later during model training
token_bos = torch.tensor([stoi['<|begin_of_text|>']], dtype=torch.int, device=device)
token_eos = torch.tensor([stoi['<|end_of_text|>']], dtype=torch.int, device=device)
token_pad = torch.tensor([stoi['<|pad_id|>']], dtype=torch.int, device=device)
prompts = "Hello World"
encoded_tokens = encode(prompts)
decoded_text = decode(encoded_tokens)
### Test: Input Block Code ###
# You need take out the triple quotes below to perform testing
"""
print(f"Lenth of shakespeare in character: {len(data)}")
print(f"The vocabulary looks like this: {''.join(vocab)}\n")
print(f"Vocab size: {vocab_size}")
print(f"encoded_tokens: {encoded_tokens}")
print(f"decoded_text: {decoded_text}")
"""
### Test Results: ###
"""
Lenth of shakespeare in character: 1115394
The vocabulary looks like this:
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz<|begin_of_text|><|end_of_text|><|pad_id|>
Vocab size: 68
encoded_tokens: [20, 43, 50, 50, 53, 1, 35, 53, 56, 50, 42]
decoded_text: Hello World
"""
如果您查看上面的架构图,就会发现解码器块由以下子组件组成。
让我们一一深入研究这些子组件。
为什么需要 RMSNorm? 在上面的架构图中,您一定已经注意到输入块的输出,即嵌入向量经过 RMSNorm 块。这是因为嵌入向量具有多个维度(Llama3-8b 中为 4096 个维度),并且总是有可能具有不同范围内的值。这可能会导致模型梯度爆炸或消失,从而导致收敛缓慢甚至发散。 RMSNorm 将这些值控制在一定范围内,有助于稳定和加速训练过程。这使得梯度的大小更加一致,从而使模型收敛得更快。
就像层归一化一样,RMSNorm 沿着嵌入特征或维度应用。上图具有形状 [3,3] 的嵌入,这意味着每个标记都有 3 个维度。
示例:让我们将 RMSNorm 应用于第一个标记 X1 的嵌入:
为什么选择 RMSNorm 而不是层归一化? 正如您在上面的示例中注意到的,我们没有计算在层归一化情况下完成的任何均值或方差。因此,我们可以说 RMSNorm 通过避免均值和方差的计算来减少计算开销。此外,根据作者的论文,RMSNorm 在不影响准确性的情况下提供了性能优势。
# Step2: The Decoder Block
# Note: Since the Llama 3 model is developed by Meta, so to be in sync with their codebase and for future compatibility,
# I will use most of the code from Meta GitHub with some necessary changes required to achieve our goal.
# Define parameters dataclass: we'll use these parameters during model building, training and inference.
# Note: Since we want to see the results of training and inferencing faster rather than focusing on high accuracy, we're taking lower values for most of the parameters which are set higher in the Llama 3 model.
@dataclass
class ModelArgs:
dim: int = 512 # embedding dimension
n_layers: int = 8 # number of model decoder blocks
n_heads: int = 8 # number of heads for queries embedding
n_kv_heads: int = 4 # number of heads for keys and values embedding
vocab_size: int = len(vocab) # Length of vocabulary
multiple_of: int = 256 # Require to calculate dim of feedfoward network
ffn_dim_multiplier: Optional[float] = None # Require to calculate dim of feedfoward network
norm_eps: float = 1e-5 # Default Epsilon value set for the RMSNorm calculation
rope_theta: float = 10000.0 # Default theta value for the RePE calculation
max_batch_size: int = 10 # Max batch size
max_seq_len: int = 256 # Max sequence length
epochs: int = 2500 # Total number of training iteration
log_interval: int = 10 # Number of interval to print the logs and loss values
device: str = 'cuda' if torch.cuda.is_availab