Sufficiency tracking, not raw accuracy
A cold agent answering correctly is not enough evidence that the workspace helped. Strong models can fill gaps from priors, and weaker ones can sound confident while inventing connective tissue that the surface never supplied.
Cheesy Theseus labels whether each rendered surface is actually sufficient before the model sees it. The win condition is answering when the workspace supports the answer and abstaining when it does not.