Reinforced Learning  |  June 3, 2023

Direct Preference Optimization: A New Approach to Fine-Tuning AI Models

AI has become an integral part of many organizations, driving innovation and efficiency. However, aligning AI models with human preferences remains a challenge. This is where Direct Preference Optimization (DPO) comes in.

What is Direct Preference Optimization (DPO)?

DPO is a new method for fine-tuning large unsupervised language models (LMs) to align them with human preferences. Unlike existing methods, DPO is a computationally lightweight and stable algorithm that eliminates the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning.

How Does DPO Work?

DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, without an explicit reward function or reinforcement learning. The DPO update increases the relative log probability of preferred to dispreferred responses, but it incorporates a dynamic, per-example importance weight that prevents the model degeneration that occurs with a naive probability ratio objective.


Reinforcement Learning from Human Feedback (RLHF) is a common method used to align AI models with human preferences. However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. DPO simplifies this process by directly optimizing a language model to adhere to human preferences, without explicit reward modeling or reinforcement learning.

Practical Examples of DPO

Let's consider a practical example of a customer service chatbot. With DPO, the chatbot can be fine-tuned to align with human preferences, such as providing accurate information, using polite language, and responding promptly. The DPO update increases the relative log probability of these preferred responses, resulting in a chatbot that is more effective and satisfying for users.

Another example could be a content recommendation system. Using DPO, the system can be fine-tuned to align with user preferences, such as recommending content that is relevant, interesting, and in line with the user's past behavior. This results in a recommendation system that is more personalized and engaging for users.


DPO is a new, simpler, and more efficient method for fine-tuning AI models to align with human preferences. It offers a promising approach for organizations looking to integrate AI into their work, providing a more effective and satisfying user experience.