References
- Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMsArash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, etal.2024
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language ModelsZhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, etal.2024
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, etal.2025
- Reinforcement Learning for Long-Horizon Interactive LLM AgentsKevin Chen, Marco Cusumano-Towner, Brody Huval, Aleksei Petrenko, Jackson Hamburger, etal.2025
- Buy 4 REINFORCE Samples, Get a Baseline for Free!Wouter Kool, Herke van Hoof, Max Welling2019