With hundreds of open-source models available, choosing the right one requires understanding benchmarks — standardized tests that measure model capabilities across different dimensions.
| Benchmark | Tests | Format | Gold Standard? |
|---|---|---|---|
| MMLU | Knowledge across 57 subjects (STEM, humanities, law) | Multiple choice | Yes — broad knowledge |
| HumanEval / MBPP | Code generation correctness | Write code → run tests | Yes — coding ability |
| MT-Bench | Multi-turn conversation quality | LLM-as-judge (GPT-4) | Yes — chat quality |
| Chatbot Arena (LMSYS) | Human preference in blind A/B tests | Elo rating system | Gold standard overall |
| GSM8K | Grade school math reasoning | Word problems | Yes — basic reasoning |
| MATH | Competition-level mathematics | Proof/solution | Yes — advanced reasoning |
| ARC-Challenge | Science reasoning (grade school) | Multiple choice | Moderate |
| IFEval | Instruction following accuracy | Constraint satisfaction | Yes — instruction adherence |
The lm-evaluation-harness (by EleutherAI) is the industry standard for reproducible model evaluation:
# Install
pip install lm-eval
# Run MMLU on a local model via Ollama/vLLM
lm_eval --model local-completions \
--model_args model=mistral-large,base_url=http://localhost:11434/v1 \
--tasks mmlu \
--num_fewshot 5 \
--batch_size auto \
--output_path ./results/
# Run multiple benchmarks at once
lm_eval --model hf \
--model_args pretrained=mistralai/Mistral-Small-4,dtype=float16 \
--tasks mmlu,hellaswag,arc_challenge,gsm8k \
--num_fewshot 5 \
--batch_size 8 \
--device cuda:0
# Run HumanEval for code evaluation
lm_eval --model local-completions \
--model_args model=qwen3.5:32b,base_url=http://localhost:11434/v1 \
--tasks humaneval \
--batch_size 1
MT-Bench uses GPT-4 as a judge to score model responses on a 1-10 scale across 8 categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities):
# Clone FastChat for MT-Bench
git clone https://github.com/lm-sys/FastChat.git
cd FastChat/fastchat/llm_judge
# Generate model answers
python gen_model_answer.py \
--model-path /path/to/model \
--model-id my-model
# Run GPT-4 judge
export OPENAI_API_KEY=sk-...
python gen_judgment.py --model-list my-model
# Show results
python show_result.py