Dpo
SlidesVideo Lecture
References
Training language models to follow instructions with human feedback Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, etal. 2022 Direct Preference Optimization: Your Language Model is Secretly a Reward Model Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn 2023