[ ABORT TO HUD ]
SEQ. 1

Model Evaluation & Leaderboards

📊 Evaluation & Benchmarks15 min150 BASE XP

How Do You Know If a Model Is Good?

With hundreds of open-source models available, choosing the right one requires understanding benchmarks — standardized tests that measure model capabilities across different dimensions.

Key Benchmarks Explained

BenchmarkTestsFormatGold Standard?
MMLUKnowledge across 57 subjects (STEM, humanities, law)Multiple choiceYes — broad knowledge
HumanEval / MBPPCode generation correctnessWrite code → run testsYes — coding ability
MT-BenchMulti-turn conversation qualityLLM-as-judge (GPT-4)Yes — chat quality
Chatbot Arena (LMSYS)Human preference in blind A/B testsElo rating systemGold standard overall
GSM8KGrade school math reasoningWord problemsYes — basic reasoning
MATHCompetition-level mathematicsProof/solutionYes — advanced reasoning
ARC-ChallengeScience reasoning (grade school)Multiple choiceModerate
IFEvalInstruction following accuracyConstraint satisfactionYes — instruction adherence

Running Evals Locally with lm-eval-harness

The lm-evaluation-harness (by EleutherAI) is the industry standard for reproducible model evaluation:

# Install
pip install lm-eval

# Run MMLU on a local model via Ollama/vLLM
lm_eval --model local-completions \
  --model_args model=mistral-large,base_url=http://localhost:11434/v1 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path ./results/

# Run multiple benchmarks at once
lm_eval --model hf \
  --model_args pretrained=mistralai/Mistral-Small-4,dtype=float16 \
  --tasks mmlu,hellaswag,arc_challenge,gsm8k \
  --num_fewshot 5 \
  --batch_size 8 \
  --device cuda:0

# Run HumanEval for code evaluation
lm_eval --model local-completions \
  --model_args model=qwen3.5:32b,base_url=http://localhost:11434/v1 \
  --tasks humaneval \
  --batch_size 1

Interpreting Benchmark Results

  • Don't rely on a single benchmark: A model can ace MMLU but fail at coding. Test what matters for YOUR use case.
  • Beware benchmark contamination: Some models may have trained on benchmark data. Cross-validate with newer, held-out benchmarks.
  • Chatbot Arena Elo is the most reliable overall metric because it reflects real human preferences in blind comparisons.
  • Quantization impact: Always eval your quantized model, not just the FP16 base. Q4_K_M typically loses 1-3% on MMLU vs FP16.
  • Few-shot vs zero-shot: MMLU is typically run with 5-shot. Changing this affects scores significantly.

MT-Bench: Multi-Turn Quality

MT-Bench uses GPT-4 as a judge to score model responses on a 1-10 scale across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities):

# Clone FastChat for MT-Bench
git clone https://github.com/lm-sys/FastChat.git
cd FastChat/fastchat/llm_judge

# Generate model answers
python gen_model_answer.py \
  --model-path /path/to/model \
  --model-id my-model

# Run GPT-4 judge
export OPENAI_API_KEY=sk-...
python gen_judgment.py --model-list my-model

# Show results
python show_result.py
💡 Practical Advice: For production model selection, run lm-eval-harness with benchmarks relevant to your domain, then do a manual vibe check with 50-100 real production prompts. Numbers tell half the story — qualitative evaluation on your actual workload tells the rest.
KNOWLEDGE CHECK
QUERY 1 // 3
Which benchmark is considered the gold standard for overall model quality?
MMLU
HumanEval
Chatbot Arena (LMSYS Elo rating)
ARC-Challenge
Watch: 139x Rust Speedup
Model Evaluation & Leaderboards | Evaluation & Benchmarks — Open Source AI Academy