Behind the scenes, evaluation moved from a single LLM call deciding pass/fail to a multi-turn agent that can fetch screenshots from specific steps, expand summarized history, look at step metadata, and decide when it has enough information to commit to a verdict. This is the architectural shift that made the structured verdicts (March) and the Issue Reporter foundation (March) possible.