CheesyTheseus

This project is related to work I’m currently doing on Clide, where I’m exploring how to improve researcher workspace tooling, building a workspace usable by researchers across domains while remaining legible to agents that parse through them.

As with Clide, Cheesy Theseus is part experiment and part play—referencing the beloved bit player Claude Shannon’s own Theseus as an ode to someone who always seemed to balance both beautifully.

The research question — What does an agent need to work effectively in a workspace? An agent-legibility eval for non-code research workspaces — observing how a cold agent operates across different configurations of a workspace.

Sufficiency tracking, not raw accuracy

A cold agent answering correctly is not enough evidence that the workspace helped. Strong models can fill gaps from priors, and weaker ones can sound confident while inventing connective tissue that the surface never supplied.

Cheesy Theseus labels whether each rendered surface is actually sufficient before the model sees it. The win condition is answering when the workspace supports the answer and abstaining when it does not.

Controlled workspace surfaces

The H-line verticals isolate one workspace feature at a time: verification, retrieval, static structure, navigation, surface familiarity, and later mechanism-disentanglement work over committed relations and affordances.

Each vertical renders its own lean fixture and keeps the content stable while changing the form the agent receives. That makes the eval about the workspace surface, not about an app renderer or a hand-tuned prompt.

Findings for agent-readable tools

The reports treat tools and structure as tradeoffs. Navigation and verification affordances can improve answerability, but they can also make weaker models over-eager on planted gaps unless the model actually uses the affordance honestly.

That gives Clide and other research-workspace tools a more concrete design target: make the committed structure legible enough to help, and instrument whether agents used it before trusting the result.