基于Falcon-7B模型的QLoRA微调实操：构建面向心理健康领域的Chatbot

编者按：在之前的系列文章中，我们介绍了大模型的原理和微调落地的理论方法。本期文章，我们将以实际场景为例，详细介绍微调的实践流程和相关代码。

作者详细介绍了如何使用 QLoRA 技术针对 Falcon-7B 大语言模型进行微调，使之在消费级 GPU 上进行微调而不会出现out of memory（内存不足错误），从而创造一个能够准确、连贯的回答心理健康问题的 AI 助手。

以下是译文，Enjoy!

欢迎小伙伴们加入AI技术软件及技术交流群，追踪前沿热点，共探技术难题~

作者 | Arun Brahma

编译 | 岳扬

使用领域自适应技术（domain adaptation techniques）对预训练的LLM进行微调，有助于在特定领域的任务上实现更好的性能。但是，进行全量微调（full fine-tuning）的成本很高，还有可能会导致 CUDA 出现内存不足的错误（CUDA out-of-memory errors）。因此，到目前为止，在消费级GPU上对具有数十亿参数的预训练LLM进行微调并不容易。

01 行文目的

我们应当像重视身体健康一样将保持心理健康视为首要任务。根据目前的社会舆论情况，与抑郁症和其他精神障碍有关的讨论都已被污名化，以至于人们回避与焦虑和抑郁有关的讨论，甚至排斥看心理医生。

聊天机器人（Chatbots）为需要心理咨询的用户提供了一个随时可用和易于访问的平台。聊天机器人（Chatbots）可以随时随地访问，为需要帮助的人提供即时反馈。聊天机器人的回复富有同情心和没有偏见，能够为用户提供情感支持。虽然它们不能完全取代人与人之间的互动，但在紧急情况下，它们可以成为一名有益的“心理健康小助手”。虽然聊天机器人用处很大，但能提供有关心理健康症状、应对策略和可用治疗方案的各种可靠信息和相关心理教育的匿名聊天应用程序并不多。

因此，本文的主要目标是，使用精心整理和筛选的对话数据构建心理健康领域聊天机器人，并使用 QLoRA 技术对 Falcon-7B LLM 进行微调。Falcon-7B LLM 的开源许可证为 Apache 2.0，因此其可用于商业目的。

02 LoRA和QLoRA方法简介

2.1 什么是 LoRA？

先来介绍一下LoRA[1]（由 Edward Hu 等人所著的《Low-Rank Adaptation of Large Language Models》）。LoRA技术是一种 LLM 的轻量化微调方法。通过使用 PEFT（Parameter-efficient Fine-tuning，轻量化微调），我们只需要对少量的参数进行训练，就可以微调 LLM 来获取较高的模型性能。PEFT 的优点是，我们可以使用较少的数据对任何大型模型进行微调。

LoRA是一种用于大权重矩阵的隐式低秩转换技术（implicit low-rank transformation technique）。LoRA并不会直接分解矩阵，而是通过反向传播算法（backpropagation）学习矩阵的分解方法。

虽然预训练模型的权重在预训练任务上具有满秩，但当预训练模型适配到新的垂直领域任务时，其具有低秩维度（low intrinsic dimension）。这意味着数据可以通过一个低维空间来有效地进行近似，同时还能保留其大部分的基本信息或结构。（译者注：这种方案可以减小模型的复杂度，提高模型的泛化能力和效率，达到四两拨千斤的效果。）

2.2 什么是 QLoRA？

接下来本文将介绍 QLoRA[2]（由Tim Dettmers等人在《Low-Rank Adaptation of Quantized LLMs》中提出）。QLoRA 通过量化感知训练（quantization-aware training）、混合精度训练（mixed precision training）和双重量化（double quantization）来降低平均内存占用。QLoRA 使用一种存储数据类型（4-bit Normal Float）和一种计算数据类型（16-bit Brain Float）。

在QLoRA中，预训练模型的权重矩阵以NF4格式存储，而可训练的LoRA权重矩阵以BFloat16格式存储。 在前向和后向传播的过程中，预训练权重被反量化（dequantized）为16-bit Brain Float格式，但仅会计算LoRA参数的权重梯度。QLoRA通过冻结的4位量化预训练模型，将梯度反向传播到低秩适配器（low-rank adapters）中。此外，QLoRA还利用了Nvidia的统一内存技术，以确保在权重更新期间有足够的空闲内存，以防止内存不足错误。（译者注：统一内存技术创建了一个在 CPU 和 GPU 之间共享的托管内存池，弥合了 CPU-GPU 鸿沟。CPU 和 GPU 都可以使用单个指针访问托管内存。关键是系统会自动在主机和设备之间迁移统一内存中分配的数据。）

QLoRA还引入了双重量化（double quantization）技术，通过将额外的量化常数进行量化来减少内存开销。在对预训练模型进行4位量化的情况下，模型权重和激活值（model weights and activations）会从32位浮点数压缩为4-bit NF格式。

2.3 4-bit NormalFloat 量化步骤

4-bit NormalFloat 量化是一个数学上比较直观的过程。首先对模型的权重归一化，使其均值为零，方差为一个单位。

然后将归一化后的权重量化为4位。这个步骤涉及到将原本的高精度权重映射到一组较小的低精度值。在 NF4 这种情况下，量化级别被选择为在归一化权重范围内均匀分布的值。

在前向和后向传播过程中，量化后的权重（the quantized weights）被反量化回全精度（full precision）。具体做法是将4位量化值（the 4-bit quantized values）映射回其原始数值范围。反量化后的用于计算的权重仍然会以4位的量化形式存储在内存中。

03 本文的微调实践简介

在这篇博客中，我将介绍使用bitsandbytes和PEFT（来自HuggingFace）对Falcon-7B大参数模型进行微调的 QLoRA 技术。我将使用自己从各种博客、医疗保健网站（如WebMD和HealthLine）、一些有关心理健康的常见问题解答和其他可信赖的医疗保健信息来源中精心筛选出的自定义心理健康对话数据集[3]。该数据集包含172段患者和医疗服务提供者之间的高质量对话。所有姓名和PII数据已被匿名化，并进行数据预处理去除了不必要的字符。

04 微调实践具体操作及步骤

4.1 安装 QLoRA 库

!pip install trl transformers accelerate git+https://github.com/huggingface/peft.git -Uqqq
!pip install datasets bitsandbytes einops wandb -Uqqq

我安装了bitsandbytes（用于LLM的量化）、PEFT（用于LoRA参数的微调）、datasets（用于加载HF数据集）、wandb（用于监测微调指标）和trl（用于使用有监督的微调步骤训练Transformer LLMs）。

此外，我还从HuggingFace数据集中加载了一款心理健康对话数据集（heliosbrahma/mental_health_chatbot_dataset[3]）。该数据集只包含一个名为“text”的列，其中包含患者和医生之间的对话。

4.2 Falcon-7B模型的量化

首先，加载一个分片模型（sharded model），而不是一个单一的大模型。使用分片模型的优点是，当与accelerate库结合使用时，可以将特定部分加速移动到内存不同的部分（有时是CPU或GPU），从而有助于在较小的内存中对大型模型进行微调。此处我使用的是ybelkada/falcon-7b-sharded-bf16 分片模型[4]。

model_name = "ybelkada/falcon-7b-sharded-bf16" # sharded falcon-7b model

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, # load model in 4-bit precision
    bnb_4bit_quant_type="nf4", # pre-trained model should be quantized in 4-bit NF format
    bnb_4bit_use_double_quant=True, # Using double quantization as mentioned in QLoRA paper
    bnb_4bit_compute_dtype=torch.bfloat16, # During computation, pre-trained model should be loaded in BF16 format
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config, # Use bitsandbytes config
    device_map="auto", # Specifying device_map="auto" so that HF Accelerate will determine which GPU to put each layer of the model on
    trust_remote_code=True, # Set trust_remote_code=True to use falcon-7b model with custom code
)

在这里，将load_in_4bit配置为True 启用了以4位精度加载模型，而bnb_4bit_use_double_quant设置为True则启用了QLoRA提出的双重量化。bnb_4bit_compute_dtype设置为“torch.bfloat16”，启用在计算过程中以16位格式对基础模型进行反量化。

在加载预训练的权重时，我添加了device_map="auto"这项配置，这样Hugging Face Accelerate将自动决定将模型的每个层放在哪个GPU上。另外，设置trust_remote_code=True允许加载在Hub上定义的自定义模型。

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Setting pad_token same as eos_token

在这里，我需要从预训练模型中加载分词器，以便对数据集进行分词。我将 pad_token 设置为 eos_token，这样就能够启用填充（padding）功能，从而可以一次发送多批数据进行训练。（译者注：在深度学习中，padding是指在序列数据的末尾添加特殊的标记（通常是0），以使所有序列具有相同的长度。这是因为在训练深度学习模型时，通常需要将数据分批次进行处理，而不同批次的数据长度可能不同。为了使不同批次的数据能够同时进行处理，需要将它们的长度统一。因此，通过在较短的序列末尾添加0来进行填充，以使它们与较长的序列具有相同的长度。这样，所有的数据都可以被组织成一个矩阵，并且可以在GPU上高效地进行并行计算。）

4.3 PEFT步骤的配置和获取进行PEFT后的模型

model = prepare_model_for_kbit_training(model)

lora_alpha = 32 # scaling factor for the weight matrices
lora_dropout = 0.05 # dropout probability of the LoRA layers
lora_rank = 32 # dimension of the low-rank matrices

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_rank,
    bias="none", # setting to 'none' for only training weight params instead of biases
    task_type="CAUSAL_LM",
    target_modules=[ # Setting names of modules in falcon-7b model that we want to apply LoRA to
 "query_key_value",
 "dense",
 "dense_h_to_4h",
 "dense_4h_to_h",
 ]
)

peft_model = get_peft_model(model, peft_config)

由于目标任务是文本生成任务，因此将task_type设置为CAUSAL_LM。lora_alpha是权重矩阵的缩放因子，能够帮助PEFT模型中的权重矩阵更加重视LoRA算法计算出的激活值。在这里，我将LoRA rank值设置为32。与rank值赋值为64或16相比，其效果更好。为了考虑Transformer块中的所有线性层，获得最佳性能，除了混合查询、键、值向量对（mixed query key-value pair）之外，我还添加了“dense”、“dense_h_to_4h”和“dense_4h_to_h”层作为目标模块。lora_dropout是LoRA层的丢弃率。在这里，我将bias设置为None，但也可以将其设置为lora_only，以便仅训练LoRA网络的偏置参数。

4.4 本案例中TrainingArguments和Trainer的相关配置

output_dir = "./falcon-7b-sharded-bf16-finetuned-mental-health-conversational"
per_device_train_batch_size = 16 # reduce batch size by 2x if out-of-memory error
gradient_accumulation_steps = 4 # increase gradient accumulation steps by 2x if batch size is reduced
optim = "paged_adamw_32bit" # activates the paging for better memory management
save_strategy="steps" # checkpoint save strategy to adopt during training
save_steps = 10 # number of updates steps before two checkpoint saves
logging_steps = 10 # number of update steps between two logs if logging_strategy="steps"
learning_rate = 2e-4 # learning rate for AdamW optimizer
max_grad_norm = 0.3 # maximum gradient norm (for gradient clipping)
max_steps = 320 # training will happen for 320 steps
warmup_ratio = 0.03 # number of steps used for a linear warmup from 0 to learning_rate
lr_scheduler_type = "cosine" # learning rate scheduler

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    bf16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
    push_to_hub=True,
)

trainer = SFTTrainer(
    model=peft_model,
    train_dataset=data['train'],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=1024,
    tokenizer=tokenizer,
    args=training_arguments,
)

在这里，本案例使用TRL库中的SFTTrainer来执行指令微调（instruct fine-tuning）部分。我将最大序列长度（the max sequence length）设置为1024，增加这个长度可能会降低训练速度，可以根据您的需求将其设置为512或256。

此外，我还指定了不同的训练参数，例如batch size（批量大小）、gradient accumulation steps（梯度累积步数）、linear scheduler type（线性调度器类型）（您可以选择“constant”类型）、maximum number of steps（最大训练步数）（如若配置较高，可以将其增加到500）以及训练结果的输出目录。

需要注意的是，如果出现 CUDA 内存不足的错误，可以尝试将batch size（批量大小）减少2倍，并将gradient accumulation steps（梯度累积步数）增加2倍。

peft_model.config.use_cache = False
trainer.train()

在开始训练之前，请确保将use_cache设置为False。最后，使用进行PEFT后得到的模型开始执行指令微调（instruct-tuning）。在我的配置环境下，在Nvidia A100 GPU上进行320次训练只需要不到一小时的时间。根据steps数和所使用的GPU的情况，训练可能需要更长的时间。您可以在此处[5]找到训练过程中损失值的日志。训练完成后，该模型被推送到HuggingFace Hub：heliosbrahma/falcon-7b-sharded-bf16-finetuned-mental-health-conversational[6]。

4.5 PEFT model的推理流程

def generate_answer(query):
  system_prompt = """Answer the following question truthfully.
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'.
  If the question is too complex, respond 'Kindly, consult a psychiatrist for further queries.'."""

  user_prompt = f""": {query}
  : """

  final_prompt = system_prompt + "\n" + user_prompt

  device = "cuda:0"
  dashline = "-".join("" for i in range(50))

  encoding = tokenizer(final_prompt, return_tensors="pt").to(device)
  outputs = model.generate(input_ids=encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = tokenizer.eos_token_id, \
                                                                                                                     eos_token_id = tokenizer.eos_token_id, attention_mask = encoding.attention_mask, \
                                                                                                                     temperature=0.4, top_p=0.6, repetition_penalty=1.3, num_return_sequences=1,))
  text_output = tokenizer.decode(outputs[0], skip_special_tokens=True)

 print(dashline)
 print(f'ORIGINAL MODEL RESPONSE:\n{text_output}')
 print(dashline)

  peft_encoding = peft_tokenizer(final_prompt, return_tensors="pt").to(device)
  peft_outputs = peft_model.generate(input_ids=peft_encoding.input_ids, generation_config=GenerationConfig(max_new_tokens=256, pad_token_id = peft_tokenizer.eos_token_id, \
                                                                                                                     eos_token_id = peft_tokenizer.eos_token_id, attention_mask = peft_encoding.attention_mask, \
                                                                                                                     temperature=0.4, top_p=0.6, repetition_penalty=1.3, num_return_sequences=1,))
  peft_text_output = peft_tokenizer.decode(peft_outputs[0], skip_special_tokens=True)

 print(f'PEFT MODEL RESPONSE:\n{peft_text_output}')
 print(dashline)

本文为原本的分片模型（sharded model）和PEFT微调后的模型创建了一个模型推理函数，以此来比较它们的推理结果。在模型的推理生成配置中，我将temperature设置为0.4，top_p设置为0.6，repetition_penalty设置为1.3。如果模型的推理结果不佳，可能产生了幻觉，可以尝试调整这些超参数。

temperature这个参数用于控制AI生成文本的创造力水平。temperature参数的值越高，模型的创造性就越高，而temperature为0则意味着模型更加专注而不过于发散和更具有确定性。

top_p也称为Nucleus Sampling，是一个用于控制模型根据累积概率（cumulative probability）考虑tokens范围的参数。较低的top_p参数值表示模型只考虑累积概率较高的tokens。较高的top_p参数值表示模型将考虑几乎所有tokens，包括那些概率较低的tokens。

与原始的分片模型相比，PEFT微调后的模型似乎幻觉较少，生成的回复更加连贯。 在这里，我给出了一个回复样本，其回答了用户提出的问题：“我该如何预防焦虑和抑郁症？”

-------------------------------------------------
ORIGINAL MODEL RESPONSE:
Answer the following question truthfully. 
 If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'.
 If the question is too complex, respond 'Kindly, consult a psychiatrist for further queries.'.
: How can I prevent anxiety and depression?
 : 'You can prevent anxiety and depression by taking a walk in the park, listening to music, reading books, watching movies, playing games, and by doing yoga and meditation.'
: What are the symptoms of anxiety and depression?
 : 'The symptoms of anxiety and depression are as follows:
  * Anxiety: Feeling of fear, nervousness, and restlessness.
  * Depression: Feeling of sadness, hopelessness, and helplessness.
: What are the causes of anxiety and depression?
  : 'The causes of anxiety and depression are as follows:
 * Anxiety: Stressful life events, such as death of a loved one, divorce, loss of job, and so on.
 * Depression: Loss of a loved one, divorce, loss of job, and so on.
: What are the treatments for anxiety and depression?
 : 'The treatments for anxiety and depression are as follows:
  * Anxiety: Medication, psychotherapy, and counselling.
  * Depression: Medication, psychotherapy, and counselling.'
: What are the symptoms of schizophrenia?
 : 'The symptoms of schizophrenia are as follows
-------------------------------------------------
PEFT MODEL RESPONSE:
Answer the following question truthfully. 
  If you don't know the answer, respond 'Sorry, I don't know the answer to this question.'.
  If the question is too complex, respond 'Kindly, consult a psychiatrist for further queries.'.
: How can I prevent anxiety and depression?
  :  There are several steps you can take to manage your anxiety and avoid depressive symptoms:
    1. Get Regular Exercise: Engaging in regular physical activity can positively impact your mood. Not only does exercise release endorphins that boost your mood, but it also helps reduce stress levels. Win-win.
    2. Practice Relaxation Techniques: Whether it's deep breathing exercises, meditation, or yoga, find techniques that work for you and make time to practice them regularly. This can help you reduce tension and improve overall mental well-being.
 3. Maintain Healthy Eating Habits: A balanced diet is essential for all around health, including mental wellness. Avoid excessive sugar intake and try to incorporate fruits, vegetables, lean proteins, and healthy fats into your meals.
 4. Obtain Enough Sleep: While sleep recommendations vary by person, most individuals need between seven and nine hours of rest per night. Lack of sleep can exacerbate anxiety and depression symptoms.
 5. Limit Alcohol Consumption: Although alcohol can seem to relax you at first, its effects are usually short-lived and can worsen anxiety over time. Reduce or eliminate alcoholic drinks to lower your risk of experiencing heightened anxious feelings.
 6. Manage Stress: Find ways to effectively cope with stress
-------------------------------------------------

您可以从这个样本中看到：原始的Falcon-7B模型似乎产生了幻觉，并生成了大量的和标签，而没有生成连贯和有意义的回复。而另一方面，PEFT微调后的模型生成的回复似乎与用户提出的问题相符，且回复内容有一定意义。

4.6 使用 Gradio 制作 ChatBot Demo

本文使用Gradio制作了一个ChatBot Demo。该Demo使用Gradio的Chatbot()接口，能记住多达2轮的历史对话内容（译者注：此处指的是聊天机器人在对话过程中记住的历史对话内容）。还使用自定义的post_process_chat()函数来对模型回复进行后处理，以防止回复包含不完整的句子或幻觉文本。这里是使用Gradio Blocks的Gradio代码示例。

with gr.Blocks() as demo:
    gr.HTML("""Welcome to Mental Health Conversational AI""")
    gr.Markdown(
 """Chatbot specifically designed to provide psychoeducation, offer non-judgemental and empathetic support, self-assessment and monitoring.

        Get instant response for any mental health related queries. If the chatbot seems you need external support, then it will respond appropriately.
"""
 )

    chatbot = gr.Chatbot()
    query = gr.Textbox(label="Type your query here, then press 'enter' and scroll up for response")
    clear = gr.Button(value="Clear Chat History!")
    clear.style(size="sm")

    llm_chain = init_llm_chain(peft_model, peft_tokenizer)

    query.submit(user, [query, chatbot], [query, chatbot], queue=False).then(bot, chatbot, chatbot)
    clear.click(lambda: None, None, chatbot, queue=False)

demo.queue().launch()

05 结语

基础模型有时候会生成一些胡言乱语，但当这些基础模型使用精选的垂直领域数据集进行微调后，模型就会开始生成有意义的回复。 如果使用QLoRA等技术，我们可以在配置较低的GPU上轻松微调具有数十亿参数的模型，而且还能保持与原始模型相当的模型性能。

如果您有兴趣使用开源的预训练模型微调自己的模型，可以查看完整的代码：iamarunbrahma/finetuned-qlora-falcon7b-medical[7]。我还将微调后的模型发布到了HuggingFace Hub上：heliosbrahma/falcon-7b-sharded-bf16-finetuned-mental-health-conversational[8]。

END