等风来随风飘

Byte Pair Encoding（BPE）算法及代码笔记

Byte Pair Encoding（BPE）算法

BPE算法是Transformer中构建词表的方法，大致分为如下几个步骤：

将语料中的文本切分为字符
统计高频共现二元组
将共现频率最高的二元组合并加入词表
重复上述第二和第三直到词表规模达到预先设置的数量，或没有可以合并的二元组为止

以GPT-2中BPE相关的代码为例对代码进行整理

完整代码如下所示

"""
BPE算法:字节对编码算法,将任意UTF-8字符串转换为整数索引序列,方便后续的神经网络运算。

bpe is short for Byte Pair Encoder. It translates arbitrary utf-8 strings into
sequences of integers, where each integer represents small chunks of commonly
occuring characters. This implementation is based on openai's gpt2 encoder.py:
https://github.com/openai/gpt-2/blob/master/src/encoder.py
but was mildly modified because the original implementation is a bit confusing.
I also tried to add as many comments as possible, my own understanding of what's
going on.
"""

import os
import json
import regex as re
import requests

import torch

# -----------------------------------------------------------------------------

def bytes_to_unicode():
    """
    将字节(8bit->2**8->256个)转换为unicode表示的字符。
    有些字节表示的字符太"丑"了,比如chr(0)为'\x00',OpenAI选择进行额外的转换。
    
    Every possible byte (really an integer 0..255) gets mapped by OpenAI to a unicode
    character that represents it visually. Some bytes have their appearance preserved
    because they don't cause any trouble. These are defined in list bs. For example:
    chr(33) returns "!", so in the returned dictionary we simply have d[33] -> "!".
    However, chr(0), for example, is '\x00', which looks ugly. So OpenAI maps these
    bytes, into new characters in a range where chr() returns a single nice character.
    So in the final dictionary we have d[0] -> 'Ā' instead, which is just chr(0 + 2**8).
    In particular, the space character is 32, which we can see by ord(' '). Instead,
    this function will shift space (32) by 256 to 288, so d[32] -> 'Ġ'.
    So this is just a simple one-to-one mapping of bytes 0..255 into unicode characters
    that "look nice", either in their original form, or a funny shifted character
    like 'Ā', or 'Ġ', etc.
    """
    # the 188 integers that render fine in their original form and need no shifting
    bs = list(range(ord("!"), ord("~")+1))+list(range(ord("¡"), ord("¬")+1))+list(range(ord("®"), ord("ÿ")+1))
    cs = bs[:] # all integers b in bs will simply map to chr(b) in the output dict
    # now get the representations of the other 68 integers that do need shifting
    # each will get mapped chr(256 + n), where n will grow from 0...67 in the loop
    n = 0
    for b in range(2**8):
        if b not in bs:
            # if this byte is "ugly" then map it to the next available "nice" character
            bs.append(b)
            cs.append(2**8+n)
            n += 1
    cs = [chr(n) for n in cs]
    d = dict(zip(bs, cs))
    return d

def get_pairs(word):
    """
    获取一个单词中所有可能的字符二元组
    
    Return all bigrams as a set of tuples, of consecutive elements in the iterable word.
    """
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs

class Encoder:

    def __init__(self, encoder, bpe_merges):
        # byte encoder/decoder
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v:k for k, v in self.byte_encoder.items()}
        # bpe token encoder/decoder
        self.encoder = encoder  # 将字符串转换为整数索引
        self.decoder = {v:k for k,v in self.encoder.items()}  # 将整数索引转换为字符串
        # bpe merge list that defines the bpe "tree", of tuples (a,b) that are to merge to token ab
        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
        # the splitting pattern used for pre-tokenization
        # Should haved added re.IGNORECASE so BPE merges can happen for capitalized versions of contractions <-- original openai comment
        """
        ok so what is this regex looking for, exactly?
        python re reference: https://docs.python.org/3/library/re.html
        - the vertical bars | is OR, so re.findall will chunkate text as the pieces match, from left to right
        - '\'s' would split up things like Andrej's -> (Andrej, 's)
        - ' ?\p{L}': optional space followed by 1+ unicode code points in the category "letter"
        - ' ?\p{N}': optional space followed by 1+ unicode code points in the category "number"
        - ' ?[^\s\p{L}\p{N}]+': optional space, then 1+ things that are NOT a whitespace, letter or number
        - '\s+(?!\S)': 1+ whitespace characters (e.g. space or tab or etc) UNLESS they are followed by non-whitespace
                       so this will consume whitespace characters in a sequence but exclude the last whitespace in
                       that sequence. that last whitespace has the opportunity to then match the optional ' ?' in
                       earlier patterns.
        - '\s+': 1+ whitespace characters, intended probably to catch a full trailing sequence of whitespaces at end of string
        So TLDR:
        - we are special casing a few common apostrophe constructs ('s, 't, 're, ...) and making those into separate tokens
        - we then separate out strings into consecutive chunks of 1) letters, 2) numbers, 3) non-letter-numbers, 4) whitespaces
        """
        self.pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")  # 预先使用一些正则表达式提前将字符串切分，例如将字符串划分为连续的字母、数字、空格和其他字符。包括一些英文的规则。
        self.cache = {}

    def bpe(self, token):
        """
        对每个预先切分出来的token进行进一步的bpe切分,切分主要依赖于预先统计的bpe_ranks;
        bpe_ranks: 从大规模语料中统计的bi-gram共现频率
        
        this function uses self.bpe_ranks to iteratively merge all the possible bpe tokens
        up the tree. token is a string of one individual 'word' (after regex tokenization)
        and after byte encoding, e.g. 'Ġthere'.
        """
        # token is a string of one individual 'word', after byte encoding, e.g. 'Ġthere'

        # memoization, for efficiency
        if token in self.cache:  # cache缓存加速bpe算法
            return self.cache[token]

        word = tuple(token) # individual characters that make up the token, in a tuple
        pairs = get_pairs(word) # get all bigrams

        if not pairs:
            return token

        while True:

            # find the next lowest rank bigram that can be merged
            bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))  # 优先合并共现频率高的二元组
            if bigram not in self.bpe_ranks:  # 如果剩下的二元组共现频率过低
                break # no more bigrams are eligible to be merged
            first, second = bigram

            # we will now replace all occurences of (first, second) in the list of current
            # words into one merged token first_second, in the output list new_words
            new_word = []
            i = 0
            while i < len(word):  # 合并二元组(考虑多次出现的情况)

                # find the next occurence of first in the sequence of current words
                try:
                    j = word.index(first, i)
                    new_word.extend(word[i:j])
                    i = j
                except:
                    new_word.extend(word[i:])
                    break

                # if this occurence is also followed by second, then merge them into one
                if word[i] == first and i < len(word)-1 and word[i+1] == second:
                    new_word.append(first+second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1

            # all occurences of (first, second) have been merged to first_second
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            else:
                pairs = get_pairs(word)

        # concat all words into a string, and use ' ' as the separator. Note that
        # by now all characters have been byte encoded, guaranteeing that ' ' is
        # not used in the actual data and is a 'special' delimiter character
        word = ' '.join(word)

        # cache the result and return
        self.cache[token] = word
        return word

    def encode(self, text):
        """ 
        字符串序列转整数索引序列
        
        string goes in, list of integers comes out
        """
        bpe_idx = []
        
        # pre-tokenize the input text into string tokens (words, roughly speaking)
        tokens = re.findall(self.pat, text)  # 预先使用正则表达式粗糙切分
        
        # process each token into BPE integers
        for token in tokens:  # 每个token内部使用bpe不断合并二元组
            # encode the token as a bytes (b'') object
            token_bytes = token.encode('utf-8')
            # translate all bytes to their unicode string representation and flatten
            token_translated = ''.join(self.byte_encoder[b] for b in token_bytes)
            # perform all the applicable bpe merges according to self.bpe_ranks
            token_merged = self.bpe(token_translated).split(' ')
            # translate all bpe tokens to integers
            token_ix = [self.encoder[bpe_token] for bpe_token in token_merged]
            # extend our running list of all output integers
            bpe_idx.extend(token_ix)
        return bpe_idx

    def encode_and_show_work(self, text):
        """ debugging function, same as encode but returns all intermediate work """
        bpe_idx = []
        parts = []
        tokens = re.findall(self.pat, text)
        for token in tokens:
            token_bytes = token.encode('utf-8')
            token_translated = ''.join(self.byte_encoder[b] for b in token_bytes)
            token_merged = self.bpe(token_translated).split(' ')
            token_ix = [self.encoder[bpe_token] for bpe_token in token_merged]
            bpe_idx.extend(token_ix)
            parts.append({
                'token': token,
                'token_bytes': token_bytes,
                'token_translated': token_translated,
                'token_merged': token_merged,
                'token_ix': token_ix,
            })
        out = {
            'bpe_idx': bpe_idx, # the actual output sequence
            'tokens': tokens, # result of pre-tokenization
            'parts': parts, # intermediates for each token part
        }
        return out

    def decode(self, bpe_idx):
        """ 
        整数索引序列恢复成字符串序列
        
        list of integers comes in, string comes out 
        """
        # inverse map the integers to get the tokens
        tokens_merged = [self.decoder[token] for token in bpe_idx]
        # inverse the byte encoder, e.g. recovering 'Ġ' -> ' ', and get the bytes
        tokens_flat = ''.join(tokens_merged)
        tokens_bytes = bytearray([self.byte_decoder[c] for c in tokens_flat])
        # recover the full utf-8 string
        text = tokens_bytes.decode('utf-8', errors='replace')
        return text

def get_file(local_file, remote_file):
    """ downloads remote_file to local_file if necessary """
    if not os.path.isfile(local_file):
        print(f"downloading {remote_file} to {local_file}")
        response = requests.get(remote_file)
        open(local_file, "wb").write(response.content)

def get_encoder():
    """
    从OpenAI官方的GPT-2分词器cache文件初始化
    
    Returns an instance of the GPT BPE Encoder/Decoder
    and handles caching of "database" files.
    """
    home_dir = os.path.expanduser('~')
    cache_dir = os.path.join(home_dir, '.cache', 'mingpt')
    os.makedirs(cache_dir, exist_ok=True)

    # load encoder.json that has the raw mappings from token -> bpe index
    encoder_local_file = os.path.join(cache_dir, 'encoder.json')
    encoder_remote_file = 'https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.json'
    get_file(encoder_local_file, encoder_remote_file)
    with open(encoder_local_file, 'r') as f:
        encoder = json.load(f)
    assert len(encoder) == 50257 # 256 individual byte tokens, 50,000 merged tokens, and 1 special <|endoftext|> token

    # load vocab.bpe that contains the bpe merges, i.e. the bpe tree structure
    # in the form tuples (a, b), that indicate that (a, b) is to be merged to one token ab
    vocab_local_file = os.path.join(cache_dir, 'vocab.bpe')
    vocab_remote_file = 'https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe'
    get_file(vocab_local_file, vocab_remote_file)
    with open(vocab_local_file, 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    # light postprocessing: strip the version on first line and the last line is a blank
    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
    assert len(bpe_merges) == 50000 # 50,000 merged tokens

    # construct the Encoder object and return
    enc = Encoder(encoder, bpe_merges)
    return enc

# -----------------------------------------------------------------------------

class BPETokenizer:
    """ PyTorch-aware class that wraps the Encoder above """

    def __init__(self):
        self.encoder = get_encoder()

    def __call__(self, text, return_tensors='pt'):
        # PyTorch only; here because we want to match huggingface/transformers interface
        assert return_tensors == 'pt'
        # single string input for now, in the future potentially a list of strings
        assert isinstance(text, str)
        # encode and create a "batch dimension" of 1
        idx = [self.encoder.encode(text)]
        # wrap into PyTorch tensor
        out = torch.tensor(idx, dtype=torch.long)
        return out

    def decode(self, idx):
        # ensure a simple 1D tensor for now
        assert idx.ndim == 1
        # decode indices to text
        text = self.encoder.decode(idx.tolist())
        return text

从Encoder类中bpe方法出发，理解BPE的全过程，以下为bpe方法代码：

def bpe(self, token):
	
	# cache缓存加速bpe算法
	if token in self.cache:  
	    return self.cache[token]
	
	word = tuple(token) # individual characters that make up the token, in a tuple
	pairs = get_pairs(word) # get all bigrams
	
	if not pairs:
	    return token
	
	while True:
	
	    # find the next lowest rank bigram that can be merged
	    bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))  # 优先合并共现频率高的二元组
	    if bigram not in self.bpe_ranks:  # 如果剩下的二元组共现频率过低
	        break # no more bigrams are eligible to be merged
	    first, second = bigram
	
	    # we will now replace all occurences of (first, second) in the list of current
	    # words into one merged token first_second, in the output list new_words
	    new_word = []
	    i = 0
	    while i < len(word):  # 合并二元组(考虑多次出现的情况)
	
	        # find the next occurence of first in the sequence of current words
	        try:
	            j = word.index(first, i)
	            new_word.extend(word[i:j])
	            i = j
	        except:
	            new_word.extend(word[i:])
	            break
	
	        # if this occurence is also followed by second, then merge them into one
	        if word[i] == first and i < len(word)-1 and word[i+1] == second:
	            new_word.append(first+second)
	            i += 2
	        else:
	            new_word.append(word[i])
	            i += 1
	
	    # all occurences of (first, second) have been merged to first_second
	    new_word = tuple(new_word)
	    word = new_word
	    if len(word) == 1:
	        break
	    else:
	        pairs = get_pairs(word)
	
	# concat all words into a string, and use ' ' as the separator. Note that
	# by now all characters have been byte encoded, guaranteeing that ' ' is
	# not used in the actual data and is a 'special' delimiter character
	word = ' '.join(word)
	
	# cache the result and return
	self.cache[token] = word
	return word

以下是对bpe方法代码分块进行解读：

"""
在Encoder类中初始化一个缓存空间，在每次对token进行bpe操作时先验证缓存空间中是否包含，若有包含则直接结束。
"""
# cache缓存加速bpe算法
if token in self.cache:  
    return self.cache[token]

"""
将输入bpe方法的token进行切分，此时输入的token是一个已将文本切分后的单词，使用tuple对单词中所有字符进行拆分形成一个包含token中所有字符的元组。
"""
word = tuple(token) # individual characters that make up the token, in a tuple

"""
使用get_pairs函数通过对已经拆分好的token字符元组获取所有可能的字符二元组
"""
pairs = get_pairs(word) # get all bigrams

"""
输入的word是token中所有字符的有序元组，从元组中的第一个字符开始，每两个相邻的字符组成一个二元组
"""
def get_pairs(word):
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs

"""
判断输入的token是否产生了二元组，若没有产生二元组则结束
"""
if not pairs:
	return token

"""
找到生成的二元组中共现频率最高的，其中使用bpe_ranks获得二元组频率排名，通过排名找到排名最小也就是频率最高的二元组
"""
# find the next lowest rank bigram that can be merged
bigram = min(pairs, key = lambda pair: self.bpe_ranks.get(pair, float('inf')))  # 优先合并共现频率高的二元组

"""
形成二元组对应共现频率的字典，其中bpe_merges是从已经统计好的文件中读取二元组频率数据
"""
self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))

"""
读取的文件中每行是一个二元组，行号即为频率，行号越小频率越高
"""
vocab_local_file = os.path.join(cache_dir, 'vocab.bpe')
vocab_remote_file = 'https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe'
get_file(vocab_local_file, vocab_remote_file)
with open(vocab_local_file, 'r', encoding="utf-8") as f:
    bpe_data = f.read()
# light postprocessing: strip the version on first line and the last line is a blank
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]

"""
bpe_ranks中不存在的频率过低的二元组直接跳过，first代表二元组中的第一个字符，second代表二元组中第二个字符
"""
if bigram not in self.bpe_ranks:  # 如果剩下的二元组共现频率过低
	break # no more bigrams are eligible to be merged
first, second = bigram

"""
此部分代码是将token中所有的字符和最高频率二元组加入到new_word列表中
"""
# we will now replace all occurences of (first, second) in the list of current
# words into one merged token first_second, in the output list new_words
new_word = []
i = 0
while i < len(word):  # 合并二元组(考虑多次出现的情况)

    # find the next occurence of first in the sequence of current words
    try:
        j = word.index(first, i)
        new_word.extend(word[i:j])
        i = j
    except:
        new_word.extend(word[i:])
        break

    # if this occurence is also followed by second, then merge them into one
    if word[i] == first and i < len(word)-1 and word[i+1] == second:
        new_word.append(first+second)
        i += 2
    else:
        new_word.append(word[i])
        i += 1

"""
如果新生成的字符只有一个则直接退出，如果有多个则获得新的字符对继续执行
"""
# all occurences of (first, second) have been merged to first_second
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
    break
else:
    pairs = get_pairs(word)

"""
最后将字符通过空格连接为一个字符串，并存入缓存中
"""
word = ' '.join(word)

# cache the result and return
self.cache[token] = word

注

本文以GPT-2中的BPE代码为例，主要记录了其中Encoder类里的bpe方法相关代码的阅读笔记

LLM中最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息吗？ ZhangJiQun&MXP 教学 2024大模型以及算力 2021 AI python 机器学习算法深度学习人工智能
LLM中最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息吗？在大语言模型（LLM）中，最后一个词语的表征（隐藏状态）通常会融合前面所有词语的信息，这是由LLM的核心架构（以Transformer为基础）决定的，具体可以从以下角度理解：1.核心机制：自注意力（Self-Attention）的作用现代LLM（如GPT系列、Qwen等）均基于Transformer架构，其核心是自注意力机制。在
LLM的表征做减法的是什么，自然语言是一个矩阵，怎么进行减法的 ZhangJiQun&MXP 教学 2024大模型以及算力 2021 AI python 计算机视觉人工智能机器学习算法深度学习
LLM的表征做减法的是什么，自然语言是一个矩阵，怎么进行减法的有个假设：就是最后一个词语融合了前面词语的信息减法操作主要用于提取模型内部表征中的"诚实性"概念向量。具体来说，这是通过对比诚实和不诚实场景下的模型隐藏状态实现的。importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,AutoConfigimportnum
LLM-生成器判别器的实现
总结首先，使用GPT模型获取每个词的生成概率pLLMp_{LLM}pLLM。然后，使用训练好的生成判别器，对每个可能的生成结果进行打分，得到pθ(c∣x1:t)p_\theta(c|x_{1:t})pθ(c∣x1:t)。最后，结合两者的输出，用贝叶斯规则调整每个词的概率，选择调整后的概率最高的词作为输出。通过这样的组合，生成过程可以更好地满足预期需求，如生成符合特定风格或格式的文本。要在使用已经预
AI MCP教程之什么是 MCP？利用本地 LLM 、MCP、DeepSeek 集成构建您自己的 AI 驱动工具知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 mcp deepseek
介绍利用模型上下文协议(MCP)的工具吸引了我们的注意力—将AI变成触手可及的生产力引擎。它们巧妙、高效，让人难以抗拒。但如果您可以将这样的功能添加到自己的工具中，会怎么样呢？在本指南中，我将引导您构建一个具有本地运行的大型语言模型(LLM)和MCP集成的AI工具-让您以类似的方式自动执行利用MCP的工具您喜欢的任务。推荐文章《AnythingLLM教程系列之12AnythingLLM上的Olla
使用 Ollama 、 DeepSeek和QWEN的模型上下文协议 (MCP) ，使用本地 LLM 教程的 MCP 服务器知识大胖 NVIDIA GPU和大语言模型开发教程服务器运维人工智能 qwen2vl deepseek
简介模型上下文协议：MCP服务器据称是AI领域的下一个重大改变者，它将使AI代理变得比我们想象的更加先进。MCP或模型上下文协议由Anthropic去年发布，它可以帮助LLM连接软件并对其进行控制。但有一个问题大多数MCP服务器都与ClaudeAI兼容，尤其是ClaudeAI桌面应用程序，但它们有自己的限制。有没有办法我们可以使用本地LLM运行MCP服务器？是的，在这个特定的逐步详细教程中，我们将
在 Obsidian 中本地使用 DeepSeek — 无需互联网！知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 deepseek
简介您是否想在Obsidian内免费使用类似于ChatGPT的本地LLM？如果是，那么本指南适合您！我将引导您完成在Obsidian中安装和使用DeepSeek-R1模型的确切步骤，这样您就可以在笔记中拥有一个由AI驱动的第二大脑。推荐文章《24GBGPU中的DeepSeekR1：UnslothAI针对671B参数模型进行动态量化》权重1，DeepSeek类《在RaspberryPi上运行语音识别
Llama-Omni会说话的人工智能“语音到语音LLM” 利用低延迟、高质量语音转语音 AI 彻底改变对话方式（教程含源码）知识大胖 NVIDIA GPU和大语言模型开发教程 llama 人工智能 nvidia llm
介绍“单靠技术是不够的——技术与文科、人文学科的结合，才能产生让我们心花怒放的成果。”——史蒂夫·乔布斯近年来，人机交互领域发生了重大变化，尤其是随着ChatGPT、GPT-4等大型语言模型(LLM)的出现。虽然这些模型主要基于文本，但人们对语音交互的兴趣日益浓厚，以使人机对话更加无缝和自然。然而，实现语音交互而不受语音转文本处理中常见的延迟和错误的影响仍然是一个挑战。关键字：Llama-Omni
OpenWebUI系列之如何通过docker自动将其更新到OpenWebUI最新版本知识大胖 NVIDIA GPU和大语言模型开发教程 docker llm openwebui
实战需求OpenWebUI是一个可扩展、功能丰富且用户友好的自托管WebUI，旨在完全离线运行。它支持各种LLM运行器，包括Ollama和OpenAI兼容API。如何通过docker自动将其更新到OpenWebUI最新版本？系列文章《OpenWebUI系列之如何通过docker更新到OpenWebUI的最新版本》权重0，本地类、opewebui类《OpenWebUI系列之如何通过docker自动将
AnythingLLM教程系列之 12 AnythingLLM 上的 Ollama 与 MySQL+PostgreSQL 知识大胖 NVIDIA GPU和大语言模型开发教程 mysql postgresql 数据库 anythingllm ollama
简介一款全栈应用程序，可让您将任何文档、资源或内容转换为上下文，任何LLM都可以在聊天期间将其用作参考。此应用程序允许您选择要使用的LLM或矢量数据库，并支持多用户管理和权限。本文将介绍如何在AnythingLLM上将Ollama与MySQL+PostgreSQL连接起来。系列文章如何安装《无需任何代码构建自己的大模型知识库：AnythingLLM最易于使用的一体化AI应用程序，可以执行RAG、A
AnythingLLM教程系列之 09 AnythingLLM 支持自定义音频转录提供程序知识大胖 NVIDIA GPU和大语言模型开发教程 llama3 anythingllm llm
什么是AnythingLLM?AnythingLLM是最易于使用的一体化AI应用程序，可以执行RAG、AI代理等操作，且无需任何代码或基础设施难题。您需要为您的企业或组织提供一款完全可定制、私有且一体化的AI应用程序，该应用程序基本上是一个具有许可的完整ChatGPT，但具有任何LLM、嵌入模型或矢量数据库。如何安装《无需任何代码构建自己的大模型知识库：AnythingLLM最易于使用的一体化AI
AnythingLLM教程系列之 04 AnythingLLM 允许您以正确的格式导出聊天日志，以构建 GPT-3.5 和 OpenAI 上其他可用模型的微调模型（教程含安装步骤）知识大胖 NVIDIA GPU和大语言模型开发教程 llama3 ai anythinllm llama
什么是AnythingLLM?AnythingLLM是最易于使用的一体化AI应用程序，可以执行RAG、AI代理等操作，且无需任何代码或基础设施难题。您需要为您的企业或组织提供一款完全可定制、私有且一体化的AI应用程序，该应用程序基本上是一个具有许可的完整ChatGPT，但具有任何LLM、嵌入模型或矢量数据库。如何安装《无需任何代码构建自己的大模型知识库：AnythingLLM最易于使用的一体化AI
【AI大模型】LLM模型架构深度解析：BERT vs. GPT vs. T5 我爱一条柴ya 学习AI记录 ai 人工智能 AI编程 python
引言Transformer架构的诞生（Vaswanietal.,2017）彻底改变了自然语言处理（NLP）。在其基础上，BERT、GPT和T5分别代表了三种不同的模型范式，主导了预训练语言模型的演进。理解它们的差异是LLM开发和学习的基石。一、核心架构对比特性BERT(BidirectionalEncoder)GPT(GenerativePre-trainedTransformer)T5(Text
LLM 大模型学习必知必会系列(十三)：基于SWIFT的VLLM推理加速与部署实战汀、人工智能 LLM技术汇总人工智能自然语言处理 LLM Agent vLLM AI大模型大模型部署
LLM大模型学习必知必会系列(十三)：基于SWIFT的VLLM推理加速与部署实战1.环境准备GPU设备:A10,3090,V100,A100均可.#设置pip全局镜像(加速下载)pipconfigsetglobal.index-urlhttps://mirrors.aliyun.com/pypi/simple/#安装ms-swiftpipinstall'ms-swift[llm]'-U#vllm与
【实战AI】macbook M1 本地ollama运行deepseek 东方鲤鱼 chat AI macos ai llama AIGC chatgpt
由于deepseek官网或者Aapi调用会有网络延迟或不响应的情况，故在本地搭建部署；前提条件1.由于需要拉取开源镜像，受网络限制，部分资源在前提中会下载的更快！请自行；2.设备macbookM132G下载ollamaOllama是一款跨平台推理框架客户端（MacOS、Windows、Linux），专为无缝部署大型语言模型（LLM）（如Llama2、Mistral、Llava等）而设计。通过一键式
思维链革命：让大模型突破“机器思考”的边界 John Song AI 人工智能思维链2.0 CoT 多模态思维链元认知优化
以下是对LilianWeng思维链技术深度解析文章（原文链接）的博客化重构，融合技术本质与应用实践：思维链革命：让大模型突破“机器思考”的边界——解析ChainofThought技术体系与下一代推理架构一、从黑箱到透明思考：CoT的核心突破传统LLM困境：“大模型如同天才自闭症患者——知识渊博却无法展示思考路径”CoT解决方案：#标准CoT提示模板prompt="""问题：小明有5个苹果，吃掉2个
LLM Agent在多模态任务中的推理机制详解
文章目录一、引言二、多模态LLMAgent的基本架构2.1系统组成2.2工作流程图三、多模态表示与对齐3.1跨模态嵌入空间3.2模态对齐技术四、多模态推理策略4.1基于提示的推理(Prompt-basedReasoning)4.2多模态思维链(CoT)推理4.3多模态工具使用五、实现案例：多模态问答系统5.1系统架构5.2示例应用六、高级多模态推理技术6.1多模态递归推理6.2多模态记忆与检索6.
在mac m1基于llama.cpp运行deepseek
lama.cpp是一个高效的机器学习推理库，目标是在各种硬件上实现LLM推断，保持最小设置和最先进性能。llama.cpp支持1.5位、2位、3位、4位、5位、6位和8位整数量化，通过ARMNEON、Accelerate和Metal支持Apple芯片，使得在MACM1处理器上运行Deepseek大模型成为可能。1下载llama.cppgitclonehttps://github.com/ggerg
LLaMA-Omni 深度解析：打开通往无缝人机语音交互的大门 kakaZhui 前沿多模态大模型：论文与实战 llama 交互 LLM TTS 语音识别语音合成人工智能
一、引言：语音交互大模型今天我们来看语音交互大模型LLaMA-Omni，它由中国科学院计算技术研究所的研究者们推出，是一个基于强大的Llama-3.1-8B-Instruct构建的语音语言模型。LLaMA-Omni不仅实现了低至226ms的惊人交互延迟，还能同时生成高质量的文本与语音回复，真正意义上让大语言模型（LLM）具备了“听说”的能力。这篇博客将带你由浅入深，全方位地探索LLaMA-Omni
在LLM快速迭代时代构建持久AI应用：架构设计与实施策略
引言：技术浪潮下的开发困境大型语言模型(LLM)的发展速度令人瞠目：从GPT-3到GPT-4，从Claude1到Claude3，从Llama1到Llama3，迭代周期正在从"年"缩短到"月"。作为一名AI应用开发者，我亲身经历了这种技术浪潮带来的挑战：昨天精心调优的prompt今天可能失效；上个季度集成的模型这个季度已有更优选择；刚完成的功能设计瞬间被新模型的能力超越。在如此快速变化的环境中，如何
大型语言模型（LLM, Large Language Models）基模和 Chat 模型之间的区别
一、概述最近看大模型相关的知识，有看到大模型都有基础模型（base）和对话模型（chat），不太清楚什么时候用到基础模型，什么时候用到对话模型，故有此文。通过了解，最简单的概述就是基于基础模型会训练出一个对话（Chat）模型，对话模型主要用于对话场景，基础模型主要做文本生成，没有上下文对话的能力。在模型命名上也能看出来区别，例如：Qwen-72B和Qwen-72B-ChatChatGLM3-6B-
SpringBoot集成LangChain4j：构建智能AI应用全解析 java干货仓库八股文汇总 Spring 大模型 spring boot 人工智能后端
在企业级应用中融入大语言模型(LLM)能力已成为趋势，而LangChain4j作为专为Java设计的LLM集成框架，与SpringBoot的结合为开发者提供了强大而灵活的解决方案。本文将从基础概念到高级应用，全面解析如何利用这一组合构建智能AI应用。一、LangChain4j概述1.1什么是LangChain4j？LangChain4j是一个开源Java框架，灵感来源于Python的LangCha
基于 esp32-s3，结合私有化大模型，集asr语音识别、llm大模型、tts语音合成，设计一个技术方案，要求用websocket保持长链接，
以下方案演示了如何基于ESP32-S3，通过私有化大模型组合ASR（语音识别）、LLM（语言大模型）和TTS（语音合成）来构建一个语音交互系统，并且通过WebSocket保持与服务器的长连接通讯。整体方案分为以下几个部分：系统整体架构与数据流协议设计与消息格式服务器端实现示例ESP32-S3端实现示例运行流程与示例下面将对各部分进行详细说明。ESP32-S3没想到私有化大模型速度也能这么快ESP3
详解LLMOps，将DevOps用于大语言模型开发
大家好，在机器学习领域，随着技术的不断发展，将大型语言模型（LLMs）集成到商业产品中已成为一种趋势，同时也带来了许多挑战。为了有效应对这些挑战，数据科学家们转向了一种新型的DevOps实践LLM-OPS，专为大型语言模型的开发和维护而设计。本文将介绍LLM-OPS的核心思想，并分析这一策略如何帮助数据科学家更高效地运用DevOps的优秀实践，从而在语言模型的开发和部署过程中，提升工作效率和成果的
代码与 AI 的交响乐：探索 avante.nvim 的智能编程革命步子哥人工智能
在编程的世界里，代码不仅是逻辑的堆砌，更是一场思想与技术的交响乐。avante.nvim，一个运行在Neovim上的AI驱动插件，正以其智能化的代码补全、生成和编辑功能，为开发者奏响一曲高效与创新的乐章。本文将带你走进avante.nvim的世界，探索它如何通过大语言模型（LLM）和上下文感知机制，重新定义编程的艺术与科学。从灵感火花到代码现实：AI驱动的编程新时代想象一下，你正在编写一个复杂的P
零代码玩转大模型！LLaMA Factory：你的专属模型精修师 jane_xing 人工智能 llama
你是否曾对大语言模型（LLM）的强大能力心驰神往，却苦于以下难题？想定制专属模型？微调代码看不懂，环境配置太复杂…硬件资源有限？动辄需要数张A100，普通设备望而却步…中文任务不给力？原生模型中文理解弱，效果难达预期…部署门槛高？模型优化、压缩、服务化步步是坎？好消息是：LLaMAFactory来拯救你啦！它就像一家功能齐全的“模型精修店”，让你无需深厚AI功底，也能轻松定制、优化和部署大模型！一
掌握LLM工程课，让你的AI之旅充满惊喜
掌控AI时代的密码：深入LLM工程课程在人工智能迅猛发展的今天，对大语言模型（LLM）的深入理解和应用能力已经成为引领技术潮流的重要基石。为了帮助更多人掌握这项核心技术，特此分享关于LLM工程的一项出色在线课程。这门课程引导您通过一段为期八周的旅程，掌握AI及大语言模型的精髓，从而达到熟练应用的水平。探索LLM的世界课程以项目为驱动，通过循序渐进的学习模块，帮助您在LLM的世界中行稳致远。每周的内
Dify小白入门指南：通过官方文档学习工作流编排和API调用伟大无须多言学习 dify ai
Dify小白入门指南：通过官方文档学习工作流编排和API调用一、Dify平台简介与核心功能Dify是一个开源的LLM应用开发平台，被设计为一个"生成式AI应用创新引擎"，它提供了从Agent构建到AI工作流编排、RAG检索、模型管理等全方位能力，帮助用户轻松构建和运营生成式AI原生应用。作为一个强大的LLMOps平台，Dify已成为众多开发者构建AI应用的首选工具，尤其适合想要快速开发AI应用但缺
大模型 Agent（智能体）技术简介北京地铁1号线自然语言处理与大语言模型大模型语言模型 Agent
大模型Agent（智能体）技术是当前人工智能领域的前沿方向，它赋予大型语言模型（LLM）自主感知、规划、决策和行动的能力，使其不再局限于“被动应答”，而是能主动完成复杂任务。简单来说，Agent是一个以LLM为“大脑”的自主智能系统，能够理解目标、使用工具、与环境交互并最终解决问题。一、为什么需要Agent？——大模型的局限与Agent的使命传统的大语言模型（如GPT-4、Claude、Llama
大模型的温度？解读Temperature 半吊子全栈工匠
LLM温度是一个参数，它控制着LLM预测的下一个单词的概率分布。它通过改变下一个单词被选中的可能性，为LLM的输出增加了一些随机性或多样性。温度可以影响LLM的输出，使其更确定(可预测)或更随机(随机)，这样的参数被用来模拟或模仿人类语言产生的内在变化。1.LLMTemperature的简要回顾在生产环境中，较低的温度值(1)可以导致更随机或随机和多变的LLM输出，被称为使LLM更“创造性”。然而
大模型服务的推理优化探索半吊子全栈工匠
【引】有的事情别人不问时我们明白，一旦要我们解释它我们就不明白了，而这正是我们必须留心思索的东西。于是，开启了一次又一次的论文阅读之旅。开发并部署大模型应用肯定要考虑它们的服务成本。然而，钱并不是唯一的考虑因素，如果不能解决模型性能方面的问题，即使有很大的预算，大模型服务仍会受到影响。本文尝试讨论将LLM推理服务更改为高吞吐量引擎的挑战与应对方法。1.大模型服务面临的挑战大模型的能力令人惊叹，但其
java观察者模式 3213213333332132 java 设计模式游戏观察者模式
观察者模式——顾名思义，就是一个对象观察另一个对象，当被观察的对象发生变化时，观察者也会跟着变化。在日常中，我们配java环境变量时，设置一个JAVAHOME变量,这就是被观察者，使用了JAVAHOME变量的对象都是观察者，一旦JAVAHOME的路径改动，其他的也会跟着改动。这样的例子很多，我想用小时候玩的老鹰捉小鸡游戏来简单的描绘观察者模式。老鹰会变成观察者，母鸡和小鸡是
TFS RESTful API 模拟上传测试 ronin47
TFS RESTful API 模拟上传测试。　　细节参看这里：https://github.com/alibaba/nginx-tfs/blob/master/TFS_RESTful_API.markdown 模拟POST上传一个图片： curl --data-binary @/opt/tfs.png http
PHP常用设计模式单例, 工厂, 观察者, 责任链, 装饰, 策略,适配,桥接模式 dcj3sjt126com 设计模式 PHP
// 多态, 在JAVA中是这样用的, 其实在PHP当中可以自然消除, 因为参数是动态的, 你传什么过来都可以, 不限制类型, 直接调用类的方法 abstract class Tiger { public abstract function climb(); } class XTiger extends Tiger { public function climb()
hibernate 171815164 Hibernate
main,save Configuration conf =new Configuration().configure(); SessionFactory sf=conf.buildSessionFactory(); Session sess=sf.openSession(); Transaction tx=sess.beginTransaction(); News a=new
Ant实例分析 g21121 ant
下面是一个Ant构建文件的实例，通过这个实例我们可以很清楚的理顺构建一个项目的顺序及依赖关系，从而编写出更加合理的构建文件。下面是build.xml的代码： <?xml version="1
[简单]工作记录_接口返回405原因 53873039oycg 工作
最近调接口时候一直报错，错误信息是: responseCode:405 responseMsg:Method Not Allowed 接口请求方式Post.
关于java.lang.ClassNotFoundException 和 java.lang.NoClassDefFoundError 的区别程序员是怎么炼成的
真正完成类的加载工作是通过调用 defineClass来实现的；而启动类的加载过程是通过调用 loadClass来实现的；就是类加载器分为加载和定义 protected Class<?> findClass(String name) throws ClassNotFoundExcept
JDBC学习笔记-JDBC详细的操作流程 aijuans jdbc
所有的JDBC应用程序都具有下面的基本流程：　　1、加载数据库驱动并建立到数据库的连接。　　2、执行SQL语句。　　3、处理结果。　　4、从数据库断开连接释放资源。下面我们就来仔细看一看每一个步骤：其实按照上面所说每个阶段都可得单独拿出来写成一个独立的类方法文件。共别的应用来调用。 1、加载数据库驱动并建立到数据库的连接： Html代码 St
rome创建rss antonyup_2006 tomcat cms xml struts Opera
引用 1.RSS标准 RSS标准比较混乱，主要有以下3个系列 RSS 0.9x / 2.0 : RSS技术诞生于1999年的网景公司(Netscape)，其发布了一个0.9版本的规范。2001年，RSS技术标准的发展工作被Userland Software公司的戴夫温那(Dave Winer)所接手。陆续发布了0.9x的系列版本。当W3C小组发布RSS 1.0后，Dave W
html表格和表单基础百合不是茶 html 表格表单 meta 锚点
第一次用html来写东西,感觉压力山大,每次看见别人发的都是比较牛逼的再看看自己什么都还不会, html是一种标记语言,其实很简单都是固定的格式 _----------------------------------------表格和表单表格是html的重要组成部分,表格用在body里面的主要用法如下; <table> &
ibatis如何传入完整的sql语句 bijian1013 java sql ibatis
ibatis如何传入完整的sql语句？进一步说，String str ="select * from test_table"，我想把str传入ibatis中执行，是传递整条sql语句。解决办法： <
精通Oracle10编程SQL(14)开发动态SQL bijian1013 oracle 数据库 plsql
/* *开发动态SQL */ --使用EXECUTE IMMEDIATE处理DDL操作 CREATE OR REPLACE PROCEDURE drop_table(table_name varchar2) is sql_statement varchar2(100); begin sql_statement:='DROP TABLE '||table_name;
【Linux命令】Linux工作中常用命令 bit1129 linux命令
不断的总结工作中常用的Linux命令 1.查看端口被哪个进程占用通过这个命令可以得到占用8085端口的进程号，然后通过ps -ef|grep 进程号得到进程的详细信息 netstat -anp | grep 8085 察看进程ID对应的进程占用的端口号 netstat -anp | grep 进程ID &
优秀网站和文档收集白糖_ 网站
集成 Flex, Spring, Hibernate 构建应用程序性能测试工具-JMeter Hmtl5-IOCN网站 Oracle精简版教程网站鸟哥的linux私房菜 Jetty中文文档 50个jquery必备代码片段 swfobject.js检测flash版本号工具
angular.extend boyitech AngularJS angular.extend AngularJS API
angular.extend 复制src对象中的属性去dst对象中. 支持多个src对象. 如果你不想改变一个对象，你可以把dst设为空对象{}: var object = angular.extend({}, object1, object2). 注意: angular.extend不支持递归复制. 使用方法: angular.extend(dst, src); 参数:
java-谷歌面试题-设计方便提取中数的数据结构 bylijinnan java
网上找了一下这道题的解答，但都是提供思路，没有提供具体实现。其中使用大小堆这个思路看似简单，但实现起来要考虑很多。以下分别用排序数组和大小堆来实现。使用大小堆： import java.util.Arrays; public class MedianInHeap { /** * 题目：设计方便提取中数的数据结构 * 设计一个数据结构，其中包含两个函数，1.插
ajaxFileUpload 针对 ie jquery 1.7+不能使用问题修复版本 Chen.H ajaxFileUpload ie6 ie7 ie8 ie9
jQuery.extend({ handleError: function( s, xhr, status, e ) { // If a local callback was specified, fire it if ( s.error ) { s.error.call( s.context || s, xhr, status, e ); }
[机器人制造原则]机器人的电池和存储器必须可以替换 comsci 制造
机器人的身体随时随地可能被外来力量所破坏,但是如果机器人的存储器和电池可以更换,那么这个机器人的思维和记忆力就可以保存下来,即使身体受到伤害,在把存储器取下来安装到一个新的身体上之后,原有的性格和能力都可以继续维持..... 另外,如果一
Oracle Multitable INSERT 的用法 daizj oracle
转载Oracle笔记-Multitable INSERT 的用法 http://blog.chinaunix.net/uid-8504518-id-3310531.html 一、Insert基础用法语法： Insert Into 表名 (字段1,字段2,字段3...） Values (值1,
专访黑客历史学家George Dyson datamachine on
20世纪最具威力的两项发明——核弹和计算机出自同一时代、同一群年青人。可是，与大名鼎鼎的曼哈顿计划（第二次世界大战中美国原子弹研究计划）相比，计算机的起源显得默默无闻。出身计算机世家的历史学家George Dyson在其新书《图灵大教堂》（Turing’s Cathedral）中讲述了阿兰·图灵、约翰·冯·诺依曼等一帮子天才小子创造计算机及预见计算机未来
小学6年级英语单词背诵第一课 dcj3sjt126com english word
always 总是 rice 水稻，米饭 before 在...之前 live 生活，居住 usual 通常的 early 早的 begin 开始 month 月份 year 年 last 最后的 east 东方的 high 高的 far 远的 window 窗户 world 世界 than 比...更
在线IT教育和在线IT高端教育 dcj3sjt126com 教育
codecademy http://www.codecademy.com codeschool https://www.codeschool.com teamtreehouse http://teamtreehouse.com lynda http://www.lynda.com/ Coursera https://www.coursera.
Struts2 xml校验框架所定义的校验文件蕃薯耀 Struts2 xml校验 Struts2 xml校验框架 Struts2校验
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年7月11日 15:54:59 星期六 http://fa
mac下安装rar和unrar命令 hanqunfeng mac
1.下载：http://www.rarlab.com/download.htm 选择 RAR 5.21 for Mac OS X 2.解压下载后的文件 tar -zxvf rarosx-5.2.1.tar 3.cd rar sudo install -c -o $USER unrar /bin #输入当前用户登录密码 sudo install -c -o $USER rar
三种将list转换为map的方法 jackyrong list
在本文中，介绍三种将list转换为map的方法： 1）传统方法假设有某个类如下 class Movie { private Integer rank; private String description; public Movie(Integer rank, String des
年轻程序员需要学习的5大经验 lampcy 工作 PHP 程序员
在过去的7年半时间里，我带过的软件实习生超过一打，也看到过数以百计的学生和毕业生的档案。我发现很多事情他们都需要学习。或许你会说，我说的不就是某种特定的技术、算法、数学，或者其他特定形式的知识吗？没错，这的确是需要学习的，但却并不是最重要的事情。他们需要学习的最重要的东西是“自我规范”。这些规范就是：尽可能地写出最简洁的代码；如果代码后期会因为改动而变得凌乱不堪就得重构；尽量删除没用的代码，并添加
评“女孩遭野蛮引产致终身不育 60万赔偿款1分未得”医腐深入骨髓 nannan408
先来看南方网的一则报道：再正常不过的结婚、生子，对于29岁的郑畅来说，却是一个永远也无法实现的梦想。从2010年到2015年，从24岁到29岁，一张张新旧不一的诊断书记录了她病情的同时，也清晰地记下了她人生的悲哀。　　粗暴手术让人发寒　　2010年7月，在酒店做服务员的郑畅发现自己怀孕了，可男朋友却联系不上。在没有和家人商量的情况下，她决定堕胎。　　12月5日，
使用jQuery为input输入框绑定回车键事件 VS 为a标签绑定click事件 Everyday都不同 jsp input 回车键绑定 click enter
假设如题所示的事件为同一个，必须先把该js函数抽离出来，该函数定义了监听的处理： function search() { //监听函数略...... } 为input框绑定回车事件，当用户在文本框中输入搜索关键字时，按回车键，即可触发search(): //回车绑定 $(".search").keydown(fun
EXT学习记录 tntxia ext
1. 准备（1）官网：http://www.sencha.com/ 里面有源代码和API文档下载。 EXT的域名已经从www.extjs.com改成了www.sencha.com ，但extjs这个域名会自动转到sencha上。（2）帮助文档：想要查看EXT的官方文档的话，可以去这里h
mybatis3的mapper文件报Referenced file contains errors xingguangsixian mybatis
最近使用mybatis.3.1.0时无意中碰到一个问题： The errors below were detected when validating the file "mybatis-3-mapper.dtd" via the file "account-mapper.xml". In most cases these errors can be d

Byte Pair Encoding（BPE）算法及代码笔记

Byte Pair Encoding（BPE）算法

BPE算法是Transformer中构建词表的方法，大致分为如下几个步骤：

以GPT-2中BPE相关的代码为例对代码进行整理

注

你可能感兴趣的:(LLM,LLM)