【论文阅读】DeepSeek-R1:通过强化学习激励LLMs的推理能力 | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-R1:通过强化学习激励LLMs的推理能力

DeepSeek-AI
[email protected]

目录

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningDeepSeek-R1:通过强化学习激励LLMs的推理能力

Abstract  摘要

1 Introduction  1 简介

1.1Contributions  1.1 贡献

Post-Training: Large-Scale Reinforcement Learning on the Base Model训练后:基于基础模型的大规模强化学习

Distillation: Smaller Models Can Be Powerful Too蒸馏:小型模型也可以很强大

1.2Summary of Evaluation Results1.2 评估结果摘要

2 Approach  2 方法

2.1Overview  2.1 概述

2.2 DeepSeek-R1-Zero: Reinforcement Learning on the Base Model2.2 DeepSeek-R1-Zero:基于基础模型的强化学习

2.2.1Reinforcement Learning Algorithm2.2.1 强化学习算法

Group Relative Policy Optimization组相对策略优化

2.2.2Reward Modeling  2.2.2 奖励建模

2.2.3 Training Template  2.2.3 训练模板

2.2.4 Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero2.2.4 性能、DeepSeek-R1-Zero 的自进化过程和 aha 时刻

Performance of DeepSeek-R1-ZeroDeepSeek-R1-Zero 的性能

Self-evolution Process of DeepSeek-R1-Zero深度 Seek-R1-Zero 的自我进化过程

Aha Moment of DeepSeek-R1-Zero啊哈时刻 深寻-R1-零

Drawback of DeepSeek-R1-ZeroDeepSeek-R1-Zero 的缺点

2.3 DeepSeek-R1: Reinforcement Learning with Cold Start2.3 DeepSeek-R1:冷启动下的强化学习

2.3.1Cold Start  2.3.1 冷启动

2.3.2Reasoning-oriented Reinforcement Learning2.3.2 以推理为导向的强化学习

2.3.3Rejection Sampling and Supervised Fine-Tuning2.3.3 拒绝采样和监督微调

Reasoning data  推理数据

Non-Reasoning data  非推理数据

2.3.4Reinforcement Learning for all Scenarios2.3.4 适用于所有场景的强化学习

2.4Distillation: Empower Small Models with Reasoning Capability2.4 蒸馏:赋予小型模型推理能力

3 Experiment  3 实验

Benchmarks  基准

Evaluation Prompts  评估提示

Baselines  基线

Evaluation Setup  评估设置

3.1DeepSeek-R1 Evaluation3.1 DeepSeek-R1 评估

3.2Distilled Model Evaluation3.2 蒸馏模型评估

4 Discussion  4 讨论

4.1 Distillation v.s. Reinforcement Learning4.1 蒸馏与强化学习

4.2 Unsuccessful Attempts  4.2 未遂尝试

Process Reward Model (PRM)进程奖励模型(PRM)

Monte Carlo Tree Search (MCTS)蒙特卡洛树搜索(MCTS)

5 Conclusion, Limitations, and Future Work5 结论、局限性及未来工作

References  参考文献

Appendix  附录

你可能感兴趣的:(DeepSeek,R1,&,大数据AI人工智能大模型,DeepSeek,计算,论文阅读,deepseek,agi,ai,llm,agent,cot)