如果想在 Windows 系统编译出 llama.cpp 项目(这个是github上的仓库, ggerganov/llama.cpp
),需要在Visual Studio上添加项目内的若干个源文件。这篇简陋的笔记记录了截至目前为止项目中的 main
可执行程序编译时依赖的各个代码文件和它们的路径,方便我自己事后回过头来查,算是备忘。
目前是 2023年5月16日 ,llama.cpp 项目最新的 git
提交是 2a5ee023ad3022bc0b505343394b9754587fb731
。
Author: sandyiscool
Date: Tue May 16 14:00:15 2023 +0530
Add alternate include path for openblas (#1476)
In some linux distributions (fedora, for example), the include path for openblas is located at '/usr/local/include'
需要以下文件:
examples/main/main.cpp
examples/
目录下注意:
位于examples/main/
这个文件随便写就是了,内容如下:
#ifndef BUILD_INFO_H
#define BUILD_INFO_H
#define BUILD_NUMBER 1
#define BUILD_COMMIT "2a5ee02"
#endif // BUILD_INFO_H
其中的 2a5ee02
是 git log
的提交ID前几位。
这三个文件在根目录下,主要原因是库文件 ggml.o
是从这三个文件编译出来的。而编译依赖 ggml.o
。
ggml-cuda.h
比较特殊,在文件 ggml.c
内有如下条件:
#elif defined(GGML_USE_CUBLAS)
#include "ggml-cuda.h"
所以如果没有使用CUDA加速那么可以不用文件 ggml-cuda.h
。
llama.o
是根据这几个文件编译出来的。和文件 ggml.c
内的处理方式一样,只有使用CDUA才会加载 ggml-cuda.h
。
common.o
是根据这几个文件编译出来的。
代码没有彻底针对Visual Studio兼容,导致默认情况下编译出错。解决办法有两个:
->
C/C++->
常规->
SDL检查,选择否。error loading model: this format is no longer supported (see https://github.com/ggerganov/llama.cpp/pull/1305)
llama_init_from_file: c failed to load model
llama_init_from_gpt_params: error: failed to load model 'ggml-model-q4_0.bin'
main: error: unable to load model
模型格式不支持。要么就更新模型文件,要么就找回到旧版本的代码去编译:
git reset --hard b608b55a3ea8e4760c617418538465449175bdb8
我电脑上下载的模型文件是旧的,但是为了省事将项目的提交ID切换至 b608b55a3ea8e4760c617418538465449175bdb8
。这里分析文件 examples/main/main.cpp
中的 main
函数的执行。这个文件是编译后可执行程序 main.exe
的源代码。
显然这段代码用来加载模型。
// load the model and apply lora adapter, if any
ctx = llama_init_from_gpt_params(params);
if (ctx == NULL) {
fprintf(stderr, "%s: error: unable to load model\n", __func__);
return 1;
}
关键函数 llama_init_from_gpt_params
的声明位于 examples/common.h
(C++头文件),参数类型是:
struct llama_context * llama_init_from_gpt_params(const gpt_params & params);
结构体 gpt_params
的声明同样位于文件 examples/common.h
。完整声明如下:
struct gpt_params {
int32_t seed = -1; // RNG seed
int32_t n_threads = get_num_physical_cores();
int32_t n_predict = -1; // new tokens to predict
int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)
int32_t n_ctx = 512; // context size
int32_t n_batch = 512; // batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_keep = 0; // number of tokens to keep from initial prompt
// sampling parameters
std::unordered_map logit_bias; // logit bias for specific tokens
int32_t top_k = 40; // <= 0 to use vocab size
float top_p = 0.95f; // 1.0 = disabled
float tfs_z = 1.00f; // 1.0 = disabled
float typical_p = 1.00f; // 1.0 = disabled
float temp = 0.80f; // 1.0 = disabled
float repeat_penalty = 1.10f; // 1.0 = disabled
int32_t repeat_last_n = 64; // last n tokens to penalize (0 = disable penalty, -1 = context size)
float frequency_penalty = 0.00f; // 0.0 = disabled
float presence_penalty = 0.00f; // 0.0 = disabled
int mirostat = 0; // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
float mirostat_tau = 5.00f; // target entropy
float mirostat_eta = 0.10f; // learning rate
std::string model = "models/lamma-7B/ggml-model.bin"; // model path
std::string prompt = "";
std::string path_prompt_cache = ""; // path to file for saving/loading prompt eval state
std::string input_prefix = ""; // string to prefix user inputs with
std::string input_suffix = ""; // string to suffix user inputs with
std::vector antiprompt; // string upon seeing which more user input is prompted
std::string lora_adapter = ""; // lora adapter path
std::string lora_base = ""; // base model path for the lora adapter
bool memory_f16 = true; // use f16 instead of f32 for memory kv
bool random_prompt = false; // do not randomize prompt if none provided
bool use_color = false; // use color to distinguish generations and inputs
bool interactive = false; // interactive mode
bool prompt_cache_all = false; // save user input and generations to prompt cache
bool embedding = false; // get only sentence embedding
bool interactive_first = false; // wait for user input immediately
bool multiline_input = false; // reverse the usage of `\`
bool instruct = false; // instruction mode (used for Alpaca models)
bool penalize_nl = true; // consider newlines as a repeatable token
bool perplexity = false; // compute perplexity over the prompt
bool use_mmap = true; // use mmap for faster loads
bool use_mlock = false; // use mlock to keep model in memory
bool mem_test = false; // compute maximum memory usage
bool verbose_prompt = false; // print prompt tokens before generation
};
大部分的参数都好理解。
top_k
对于预测出来的可能的词语,采用前面的几个来采样。如果 top_k
小于或者等于0那么采用词典的大小(也就是所有可能的词都参与采样)。logit_bias
Logit 偏差可用于促进或抑制特定令牌的生成。这是通过在每个标记的各自逻辑数中添加一个偏置项( bias
)来实现的。如果正偏差增加了生成概率,负偏差则降低了生成概率。这个参数在命令参数 --ignore-eos
的时候被设置。path_prompt_cache
可以缓存执行了提示词之后的模型的状态。prompt_cache_all
保存用户的输入和生成的文本到提示词缓存内。函数 llama_init_from_gpt_params
如果运行失败,将返回 NULL
。如果运行出现错误,将直接通过标准的错误输出输出信息。
测试代码:
#include "common.h"
#include
int main() {
gpt_params params;
params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";
auto llama_context = llama_init_from_gpt_params(params);
if (nullptr == llama_context) {
std::cout << "INIT FAIL" << std::endl;
return 1;
}
std::cout << "INIT SUCCESS" << std::endl;
return 0;
}
输出:
llama.cpp: loading model from D:\my_files\llama7b\ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 49954
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 68.20 KB
llama_model_load_internal: mem required = 5897.00 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
INIT SUCCESS
如果文件有问题的时候,执行失败:
llama.cpp: loading model from D:\my_files\llama7b\tmp.txt
error loading model: unknown (magic, version) combination: 8890e509, e8b6b9e5; is this really a GGML file?
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'D:\my_files\llama7b\tmp.txt'
INIT FAIL
注意,函数占用的内存比较多。
函数:
void llama_free(struct llama_context * ctx);
测试代码:
#include "common.h"
#include
int main() {
gpt_params params;
params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";
auto llama_context = llama_init_from_gpt_params(params);
llama_free(llama_context);
return 0;
}
执行语言模型的函数 llama_evel
接受的关键参数是 llama_token*
,这个参数存放提示词。所以需要先理解如何使用这个类型。头文件 llama.h
有如下声明:
typedef int llama_token;
所以就目前而言, llama_token
其实就是个整数类型。函数 llama_tokenize
可以把字符串转换为 llama_token
,它声明如下:
std::vector llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);
计算机将语言当作一个sequence,
声明位于 common.h
。使用方法可以参考如下来自 main.cpp
的代码:
// tokenize the prompt
auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);
为了方便调试,可以利用位于 llama.h
提供的如下函数将 llama_token
转换为文字输出:
// Token Id -> String. Uses the vocabulary in the provided context
LLAMA_API const char * llama_token_to_str(const struct llama_context * ctx, llama_token token);
测试代码:
#include "common.h"
#include "llama.h"
#include
int main() {
gpt_params params;
params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";
auto llama_context = llama_init_from_gpt_params(params);
auto tokens = llama_tokenize(llama_context, "Hello world.", true);
std::cout << "Tokens:" << std::endl;
for (const auto& token : tokens) {
std::cout << "<";
std::cout << token << ":" << llama_token_to_str(llama_context, token);
std::cout << ">" << std::endl;
}
llama_free(llama_context);
return 0;
}
输出(结果):
Tokens:
<1:>
<10994:Hello>
<3186: world>
<29889:.>
llama_tokenize
第三个参数改成 false
后输出:
Tokens:
<10994:Hello>
<3186: world>
<29889:.>
核心的函数是 llama_eval
,函数声明位于 llama.h
。
// Run the llama inference to obtain the logits and probabilities for the next token.
// tokens + n_tokens is the provided batch of new tokens to process
// n_past is the number of tokens to use from previous eval calls
// Returns 0 on success
LLAMA_API int llama_eval(
struct llama_context * ctx,
const llama_token * tokens,
int n_tokens,
int n_past,
int n_threads);
通过断点在调试模式下发现,主函数 main
在命令行交互式的输入情况下会进入以下的 while
语句内部(位于309行):
while (n_remain != 0 || params.interactive) {
// 这里省略此处的代码
}
在366行有 llama_eval
的使用场景:
if (llama_eval(ctx, &embd[i], n_eval, n_past, params.n_threads)) {
fprintf(stderr, "%s : failed to eval\n", __func__);
return 1;
}
观察几个入参可以得到这样的结论,函数 llama_eval
:
llama_init_from_gpt_params
返回值vector
内第一个元素的地址;可以理解为需要模型读取的 llama_token
向量的第一个元素的位置,等同于 llama_token[]
数组的头部llama_eval
调用中使用的 token 数量;我自己理解是:传给 llama_eval
的第二个参数指定的数组之中的前多少个 token 是可以不用重新计算的——使得模型(transformer)可以从第 n_past+1
个元素的位置接着上次的结果计算。n_threads
计算的时候使用多线程。函数 llama_eval
调用成功的返回值是0。
由于模型一次能处理的 token 长度有限,所以需要分多次处理,每次处理一批。 n_batch
最大不能超过512。 第一个 token 必须是 BOS ,如果需要获得一个 BOS 的 token 可以通过调用以下函数获得(位于头文件 llama.h
):
llama_token llama_token_bos();
该函数返回代表 BOS 的 token 。
测试代码:
#include "common.h"
#include "llama.h"
#include
#include
#include
void showTokens(const llama_context* llama_context, const std::vector& tokens) {
using namespace std;
const auto bos = llama_token_bos();
string tokenString;
cout << "Token info:" << endl;
cout << "Total:" << tokens.size() << endl;
cout << "int[worlds]:" << endl;
for (const auto& token : tokens) {
auto words = llama_token_to_str(llama_context, token);
if (bos == token) {
words = "`BOS`";
}
tokenString += words;
cout << (int)token << "[" << words << "]" << endl;
}
cout << "String:" << tokenString << endl;
}
int main() {
using namespace std;
gpt_params params;
params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";
auto llama_context = llama_init_from_gpt_params(params);
auto tokens = llama_tokenize(llama_context, "2,3,5,7,11,", true);
showTokens(llama_context, tokens);
auto result = llama_eval(llama_context, tokens.data(), tokens.size(), 0, 2);
if (0 == result) {
cout << "eval success" << endl;
} else {
cout << "fail" << endl;
}
llama_free(llama_context);
return 0;
}
执行结果:
llama.cpp: loading model from D:\my_files\llama7b\ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 49954
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 68.20 KB
llama_model_load_internal: mem required = 5897.00 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
Token info:
Total:11
int[worlds]:
1[`BOS`]
29906[2]
29892[,]
29941[3]
29892[,]
29945[5]
29892[,]
29955[7]
29892[,]
32040[11]
29892[,]
String:`BOS`2,3,5,7,11,
eval success
语言模型用于预测一段文本的下一个 token ,上面已经梳理了模型初始化、 token 转换和模型执行的内容。在 main.cpp
的第406行可以找到获取输出文字的办法,主要涉及以下的函数使用。
float * llama_get_logits(struct llama_context * ctx);
函数用来从模型的最后一行取出结果,返回值指向一个 float
类型的数组。这个返回值所指的数组的下标是词典代表的词的值,也就是 token ( llama_token
类型)的值;数组的元素是模型计算后认为对应的 token 出现作为下一个词的“概率”。根据代码中的注释,这个数组的值是可以修改的,可以将特定的值改小或者改大,后续调用模型来预测的时候可以使用这个修改后的值接着计算后续内容。
main.cpp
的实现是先针对模型重复输出相同内容的问题进行惩罚计算,再对计算后的结果进行采样。其中温度参数 temp
小于或者等于0的时候,直接将计算后结果中概率最大的 token 选择作为输出。下面的测试代码直接取最大概率的 token 输出,测试输入是质数序列 2,3,5,7,11,
,预期得到的下一个数应该是13。
测试代码:
#include "common.h"
#include "llama.h"
#include
llama_token get_max_probability_token(llama_context* context) {
auto tokens_probability = llama_get_logits(context);
long tokens_total = llama_n_vocab(context);
float max_probability = -100.0;
llama_token token = llama_token_eos();
for (long i = 0; i < tokens_total; i++) {
if (tokens_probability[i] > max_probability) {
max_probability = tokens_probability[i];
token = (llama_token)i;
}
}
return token;
}
int main() {
using namespace std;
gpt_params params;
params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";
auto context = llama_init_from_gpt_params(params);
auto tokens = llama_tokenize(context, "2,3,5,7,11,", true);
llama_eval(context, tokens.data(), tokens.size(), 0, 2);
cout << "next:" << llama_token_to_str(context, get_max_probability_token(context)) << endl;
llama_free(context);
return 0;
}
运行输出:
llama.cpp: loading model from D:\my_files\llama7b\ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 49954
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 68.20 KB
llama_model_load_internal: mem required = 5897.00 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
next:13
next:13
符合预期。注意实际上获取下一个输出的 token 不是直接取概率最大的就一定最好,实际使用的时候往往有很多的计算方法。比如:
除此之外还需考虑如何惩罚重复。
将预测的 token 放 llama_evel
第二个参数的末尾,处理好 n_path
后重新执行就能连续预测了。以下对话起始提示是:
Jane: Hey, Michael, I've got a problem I need your help with.
Michael: Sure, what's the problem?
代码:
#include "common.h"
#include "llama.h"
#include
#include
llama_token get_max_probability_token(llama_context* context) {
auto tokens_probability = llama_get_logits(context);
long tokens_total = llama_n_vocab(context);
float max_probability = std::numeric_limits::min();
llama_token token = llama_token_eos();
for (long i = 0; i < tokens_total; i++) {
if (tokens_probability[i] > max_probability) {
max_probability = tokens_probability[i];
token = (llama_token)i;
}
}
return token;
}
int main() {
using namespace std;
gpt_params params;
params.model = "D:\\my_files\\llama7b\\ggml-model-q4_0.bin";
auto context = llama_init_from_gpt_params(params);
auto tokens = llama_tokenize(context, "Jane: Hey, Michael, I've got a problem I need your help with.\nMichael: Sure, what's the problem?\n", true);
auto eos_token = llama_token_eos();
long n_past = 0;
for (long i = 0; tokens.size() < llama_n_ctx(context); i++) {
llama_eval(context, &tokens[n_past], tokens.size() - n_past, n_past, 2);
auto predict_token = get_max_probability_token(context);
if (predict_token == eos_token) {
break;
}
cout << llama_token_to_str(context, predict_token);
n_past += tokens.size() - n_past;
tokens.push_back(predict_token);
}
llama_free(context);
return 0;
}
输出:
llama.cpp: loading model from D:\my_files\llama7b\ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 49954
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 68.20 KB
llama_model_load_internal: mem required = 5897.00 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
Jane: I'm not sure I can trust my boyfriend anymore.
Michael: Why not?
Jane: Well, I think he's been hiding something from me.
Michael: What kind of something?
Jane: I don't know. I just have a feeling that something is wrong.
Michael: Well, I think you should talk to him about it.
Jane: Yeah, I guess I will. Thanks, Michael.
Michael: No problem.
连起来就是:
Jane: Hey, Michael, I've got a problem I need your help with.
Michael: Sure, what's the problem?
Jane: I'm not sure I can trust my boyfriend anymore.
Michael: Why not?
Jane: Well, I think he's been hiding something from me.
Michael: What kind of something?
Jane: I don't know. I just have a feeling that something is wrong.
Michael: Well, I think you should talk to him about it.
Jane: Yeah, I guess I will. Thanks, Michael.
Michael: No problem.
llama_eval
接受的 token 数量有限,在 llama_n_ctx()
个token之后就需要采用一定的手段截断丢弃先前的 token 。