Reinforcement Fine-Tuning (RFT) goes beyond traditional supervised fine-tuning by using verifiable rewards to train models on tasks where correctness can be programmatically checked — such as code execution, math proofs, or structured data extraction.
Graders are the core innovation. They can be simple (exact string match), moderate (unit test execution), or complex (LLM-as-judge with rubric). OpenAI provides built-in grader templates:
// RFT training configuration
const job = await openai.fineTuning.jobs.create({
training_file: "file-prompts123",
model: "gpt-5.4-mini",
method: {
type: "reinforcement",
reinforcement: {
grader: {
type: "code", // Programmatic grader
source: "def grade(output, expected):\n return 1.0 if output.strip() == expected.strip() else 0.0"
}
}
}
});
DPO is a complementary technique for preference alignment — teaching the model which of two outputs is preferred by humans. Instead of a scalar reward, you provide pairs of (chosen, rejected) responses. DPO adjusts the model to increase the likelihood of chosen outputs and decrease rejected ones, without needing a separate reward model.
| Method | Best For | Data Required |
|---|---|---|
| SFT (Supervised) | Format, tone, style changes | Input → ideal output pairs |
| RFT (Reinforcement) | Verifiable tasks (code, math, extraction) | Prompts + programmable grader |
| DPO (Preference) | Subjective quality, safety alignment | Chosen/rejected response pairs |