The Assessment Agent finished rolling out to all users. Every run now produces a structured verdict including key observations, expected behavior, actual behavior, and a confidence score. This is particularly useful when a run is partially right, when the failure mode isn't obvious from the final screenshot, or when you want to understand why a test was flaky.
We also tightened how the agent treats transport-layer signals. Previously it could fixate on a 404 or 401 in the network log and call a test failed even when the actual user-facing step had succeeded (common with API tests, and with UI tests that pass through error pages on their unhappy path). Those cases now resolve correctly.