Standard LLM evals ask: "Is the final answer correct?"
Agent evals must use Trajectory Scoring. They ask:
You must build a Golden Dataset of scenarios and use an LLM-as-a-Judge (e.g., prompting Claude Opus to grade a smaller agent's execution logs) to automatically score the agent on every pull request.