[ ABORT TO HUD ]
SEQ. 1
SEQ. 2
SEQ. 3

Reinforcement Fine-Tuning & DPO

🎯 Fine-Tuning & Distillation 18 min 350 BASE XP

Beyond Supervised Fine-Tuning

Reinforcement Fine-Tuning (RFT) goes beyond traditional supervised fine-tuning by using verifiable rewards to train models on tasks where correctness can be programmatically checked — such as code execution, math proofs, or structured data extraction.

How RFT Works

  1. Define a Grader: Write a programmable grader that scores model outputs (pass/fail, 0-1 score, or multi-criteria rubrics).
  2. Submit Training Prompts: Provide a set of prompts without reference answers — the model learns by generating candidates and receiving reward signals.
  3. Reinforcement Loop: The model iteratively improves, maximizing the grader's reward function. Unlike SFT, the model discovers its own optimal strategies.

Programmable Graders

Graders are the core innovation. They can be simple (exact string match), moderate (unit test execution), or complex (LLM-as-judge with rubric). OpenAI provides built-in grader templates:

// RFT training configuration
const job = await openai.fineTuning.jobs.create({
  training_file: "file-prompts123",
  model: "gpt-5.4-mini",
  method: {
    type: "reinforcement",
    reinforcement: {
      grader: {
        type: "code",  // Programmatic grader
        source: "def grade(output, expected):\n  return 1.0 if output.strip() == expected.strip() else 0.0"
      }
    }
  }
});

Direct Preference Optimization (DPO)

DPO is a complementary technique for preference alignment — teaching the model which of two outputs is preferred by humans. Instead of a scalar reward, you provide pairs of (chosen, rejected) responses. DPO adjusts the model to increase the likelihood of chosen outputs and decrease rejected ones, without needing a separate reward model.

When to Use Each

MethodBest ForData Required
SFT (Supervised)Format, tone, style changesInput → ideal output pairs
RFT (Reinforcement)Verifiable tasks (code, math, extraction)Prompts + programmable grader
DPO (Preference)Subjective quality, safety alignmentChosen/rejected response pairs
💡 Key Insight: RFT excels when you can write a grader — code that passes tests, SQL that returns correct results, extractions that match schemas. DPO excels when quality is subjective and requires human judgment.
SYNAPSE VERIFICATION
QUERY 1 // 3
What makes RFT different from standard supervised fine-tuning?
It uses more data
It uses verifiable rewards from programmable graders instead of reference answers
It's cheaper
It only works with GPT-5.5
Watch: 139x Rust Speedup
Reinforcement Fine-Tuning & DPO | Fine-Tuning & Distillation — OpenAI Academy