Dpo

Slides
Video Lecture

References

  1. Training language models to follow instructions with human feedbackLong Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, etal.2022
  2. Direct Preference Optimization: Your Language Model is Secretly a Reward ModelRafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn2023