目录
PPO(Proximal Policy Optimization)
工作原理
实现步骤
DPO(Distributed Proximal Policy Optimization)
工作原理
实现步骤
相同点
不同点
总结来说,PPO专注于通过剪切概率比率来稳定策略更新,而DPO在此基础上引入分布式计算,以提高数据收集和处理的效率,加快学习速度。
总结来说,PPO和DPO在算法框架和目标函数上有共同之处,但在实现方式、并行化程度以及适用的计算环境上存在差异,DPO特别适用于需要大规模并行处理的场景。
Proximal Policy Optimization (PPO) is a type of reinforcement learning algorithm developed by OpenAI. It is designed to perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become a default reinforcement learning algorithm at OpenAI due to its ease of use and good performance.
PPO works by trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. It uses a novel objective function that enables multiple epochs of minibatch updates. The objective function is expressed as:
[ L^{CLIP}(\theta) = \hat{E}_{t}[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \varepsilon, 1 + \varepsilon) \hat{A}_t ) ] ]
where:
The PPO algorithm implements a way to do a Trust Region update which is compatible with Stochastic Gradient Descent and simplifies the algorithm by removing the KL penalty and the need to make adaptive updates.
The PPO algorithm can be implemented using Python3 and TensorFlow. It includes scalable, parallel implementations of PPO and TRPO (Trust Region Policy Optimization) which both use MPI for data passing. OpenAI has also released a GPU-enabled implementation called PPO2, which runs approximately 3x faster than the current PPO baseline on Atari games.
Direct Preference Optimization (DPO) is introduced as a new parameterization of the reward model in Reinforcement Learning from Human Feedback (RLHF) that enables extraction of the corresponding optimal policy in closed form. It solves the standard RLHF problem with a simple classification loss, eliminating the need for sampling from the Language Model (LM) during fine-tuning or performing significant hyperparameter tuning.
DPO is stable, performant, and computationally lightweight. It fine-tunes Language Models (LMs) to align with human preferences effectively. Notably, DPO exceeds PPO-based RLHF in controlling the sentiment of generations and matches or improves response quality in summarization and single-turn dialogue tasks while being substantially simpler to implement and train.
DPO operates by directly optimizing a policy model ( \pi_\theta ) using preference data without the need for an explicit reward model. The DPO loss is computed as follows:
[ \mathcal{L}{DPO} = - \log \sigma(\beta (\log \pi\theta(y_w | x) - \log \pi_\theta(y_l | x))) ]
where:
DPO updates aim to increase the relative log probability of preferred responses over less preferred ones, incorporating a dynamic, per-sample importance weight to prevent model degradation.