介绍
RLHF(Reinforcement Learning Hyperparameter Optimization Framework)是一种用于强化学习模型的超参数优化框架。它结合了强化学习中的经典方法和贝叶斯优化技术,能够更高效地找到最佳超参数组合。下面是强化学习微调的完整 RLHF 流程:
ps: 与lora微调的区别是:RLHF多了强化学习的过程,lora微调相当于RLHF-Stage1的SFT。
参考学习资料:如何看待Geoffrey Hinton对RLHF的看法? - 知乎【科普向】Chat GPT背后的技术:什么是RLHF(人类反馈强化学习)? - 哔哩哔哩
框架
三个框架对比介绍:
RLHF几大常用框架实践对比(trlx、deepspeedchat、colossalaichat) - 知乎
实践
本次实践采用ColossalAI框架分步训练(暂不支持TP策略,支持DP策略)
官方训练介绍:https://github.com/hpcaitech/ColossalAI/tree/main/applications/Chat#rlhf-training-stage3---training-model-with-reinforcement-learning-by-human-feedback
conda环境:conda activate coati
RLHF Training Stage1 - Supervised instructs tuning
数据准备:https://huggingface.co/datasets/yizhongw/self_instruct/viewer/super_natural_instructions/train
train_sft.sh:执行监督训练shell脚本
CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nproc_per_node=1 train_sft.py \
--pretrain "/data/jupyter/LLM/models/llama-7b-hf/" \ #微调训练底模
--model 'llama' \
--strategy colossalai_zero2 \ #微调策略方法
--log_interval 10 \
--save_path /data/jupyter/your_production/ColossalAI/applications/Chat/models/sft-7b \ #保存路径
--dataset "yizhongw/self_instruct" \ #huggingface数据集
--batch_size 1 \
--accumulation_steps 8 \
--lr 2e-5 \
--max_datasets_size 512 \
--max_epochs 1 \
--lora_rank 1
ps:
train_sft.py
./train_sft.sh
RLHF Training Stage2 - Training reward model
数据准备:https://huggingface.co/datasets/Anthropic/hh-rlhf/viewer/Anthropic--hh-rlhf/train?row=1
train_rm.sh:执行奖励函数训练脚本
torchrun --standalone --nproc_per_node=1 train_reward_model.py \
--pretrain "/data/jupyter/your_prodcution/ColossalAI/applications/Chat/models/sft-7b" \ #这里是第一步训练保存的模型路径
--model 'llama' \
--strategy colossalai_gemini \ #训练策略,这里只能该策略,其他策略实测单张3090 24G显存不足
--loss_fn 'log_exp'\
--save_path /data/jupyter/your_prodcution/ColossalAI/applications/Chat/models/rmstatic.pt \ #保存模型路径,这里仅为模型权重
--dataset 'Anthropic/hh-rlhf'\ #huggingface数据集
--lora_rank 1 \
--batch_size 1 \
--max_len 128
ps:
train_reward_model.py
RLHF Training Stage2 - Training reward model
数据准备:
使用generate_prompt_dataset.py对目标数据生成prompt数据(instructions)https://github.com/XueFuzhao/InstructionWild/tree/main/data#instructwild-data
使用步骤一的pretrain dataset(including the instruction and corresponding response)https://huggingface.co/datasets/yizhongw/self_instruct/viewer/super_natural_instructions/train
train_prompts.sh:执行LM微调训练脚本
torchrun --standalone --nproc_per_node=2 train_prompts.py \
--pretrain "/data/jupyter/your_production/ColossalAI/applications/Chat/models/sft-7b" \
--model 'llama' \
--strategy colossalai_gemini \
--prompt_dataset /data/jupyter/LLM/datasets/InstructionWild/data1 \
--pretrain_dataset /data/jupyter/LLM/datasets/self_instruct \
--rm_pretrain /your/pretrain/rm/definition \
--rm_path /data/jupyter/your_production/ColossalAI/applications/Chat/models/rmstatic.pt
ps:
state_dict = torch.load(args.rm_path, map_location='cpu') reward_model = LlamaRM(pretrained=args.rm_pretrain) reward_model.load_state_dict(state_dict)
reward_model = LlamaRM(pretrained=pretrain, lora_rank=lora_rank)
_IncompatibleKeys(missing_keys=['value_head.lora_A', 'value_head.lora_B'], unexpected_keys=[])
修改critic.load_state_dict(state_dict, strict=False)可解决;
self.model = model self.value_head = value_head self.use_action_mask = use_action_mask self.convert_to_lora()