The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization
arXiv - CS - Machine Learning Pub Date : 2024-03-24 , DOI: arxiv-2403.17031
Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, Lewis Tunstall

This work is the first to openly reproduce the Reinforcement Learning from Human Feedback (RLHF) scaling behaviors reported in OpenAI's seminal TL;DR summarization work. We create an RLHF pipeline from scratch, enumerate over 20 key implementation details, and share key insights during the reproduction. Our RLHF-trained Pythia models demonstrate significant gains in response quality that scale with model size, with our 2.8B, 6.9B models outperforming OpenAI's released 1.3B checkpoint. We publicly release the trained model checkpoints and code to facilitate further research and accelerate progress in the field (\url{https://github.com/vwxyzjn/summarize_from_feedback_details}).

中文翻译：

RLHF 与 PPO 的 N+ 实现细节：TL;DR 总结案例

这项工作是第一个公开重现 OpenAI 开创性的 TL;DR 总结工作中报告的人类反馈强化学习 (RLHF) 缩放行为的工作。我们从头开始创建 RLHF 管道，列举了 20 多个关键实现细节，并在复制过程中分享了关键见解。我们经过 RLHF 训练的 Pythia 模型展示了响应质量的显着提升，该质量随模型大小而变化，我们的 2.8B、6.9B 模型的性能优于 OpenAI 发布的 1.3B 检查点。我们公开发布经过训练的模型检查点和代码，以促进进一步研究并加速该领域的进展 (\url{https://github.com/vwxyzjn/summarize_from_feedback_details})。

更新日期：2024-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>