Model Theory of Mind: From Context Engineering to Cognitive Engineering

Feb 8

By Justin Wetch

Last month I was reviewing Anthropic's open-source frontend design skill, a system prompt meant to help Claude generate distinctive web interfaces, and I found an instruction that stopped me cold: "NEVER converge on common choices across generations."

The intention was clear. The author wanted Claude/the AI to avoid falling into repetitive patterns, to not keep reaching for Space Grotesk and the same safe layouts every time someone asks for a website. That's a real problem worth solving. But there's an issue: Claude can't see its other generations. Each conversation is completely isolated. The model has no memory of what it produced for the last user, or the user before that. Telling it not to converge across generations is like telling someone not to repeat what they said in their sleep.

The instruction sounds reasonable. It's well-intentioned. It's also impossible to follow.

This kind of error fascinates me because it reveals a gap between how we intuitively think about AI and how these systems actually work. We anthropomorphize naturally, imagining Claude as a single persistent entity with accumulated habits. But from the model's perspective, every conversation is the first conversation. The author of that instruction was modeling a mind that doesn't exist.

What makes this fixable, not just diagnosable, is understanding why it fails. The model can't look backward across sessions, but you can force diversity within a single generation: "Never settle on the first common choice that comes to mind. If a font, color, or layout feels like an obvious default, deliberately explore alternatives." Same goal, now achievable.

I've started calling this skill Model Theory of Mind.

The Concept

Theory of Mind, in the psychological sense, is the capacity to attribute mental states to others: beliefs, intentions, desires, knowledge. It's what lets you predict that a scared person will act defensively, that the person across the table is nodding along but hasn't actually understood, that your friend doesn't know the surprise party is happening. You're running a simulation of someone else's inner life, and the quality of that simulation determines how effectively you communicate and collaborate. When Theory of Mind fails, conversations go sideways. You explain something at the wrong level. You assume shared context that doesn't exist. You misjudge what will persuade someone because you modeled their priorities wrong.

Model Theory of Mind is the same skill pointed at a radically different kind of mind. It's the practice of reasoning about a language model's epistemic state: what it can actually know, perceive, and act on within a single inference. It's not anthropomorphizing, it's a kind of xenopsychology, the study of a genuinely alien cognitive architecture. The model's "mental state" (a useful metaphor, not a technical claim) has a radically different structure than human cognition, and the gap between human intuition and model reality is where almost every prompting failure lives.

There's a framing from AI alignment research, sometimes called the Simulator perspective, that's useful here even if it's contested. The idea is that when you prompt a model, you're not instructing a person, you're setting up a simulation. The model predicts what a plausible response would look like given the context you've provided. This has a practical consequence I think of as the Chameleon Effect: the model has no stable self, so it calibrates to whatever register you establish. Write a prompt full of typos and vague grammar, and the model shifts toward the kind of response that would follow sloppy instructions in its training data. Act like a rigorous expert, and it simulates a rigorous peer. Your prompt doesn't just convey what you want, it summons the intelligence level of the responder.

The Epistemic Structure

A model's epistemic state differs from human cognition in five ways that prompt engineers need to internalize.

The first is episodic discontinuity. Humans possess a continuous stream of consciousness connecting past to present. A model exists only in discrete instances, with no memory of previous conversations, no background processing between sessions, no sense that time has passed. The word "memory" gets used loosely here, so it's worth being precise. A model's weights encode everything absorbed during training. The context window is working memory: short-term, explicit, containing only what's been written in the current conversation. And external state (chat history, vector databases, memory features) isn't the model remembering, it's the system re-injecting prior information into the context window. Discontinuity is the default at the model layer, even when products simulate continuity on top. The friendly assistant that greets you is a character performed fresh each time. Chat continuity is a UX convention, not a cognitive reality.

The second is frozen training data. The model's knowledge was locked at a cutoff date, which means it has confident, detailed information about the world as it existed then and nothing about what's happened since. It doesn't know it's missing this information. There's an asymmetry here that humans don't share: you generally know what you don't know, at least roughly. You have a felt sense of the boundaries of your expertise, a metacognitive boundary that makes you hesitate before speaking on unfamiliar ground. The model has no such boundary. It doesn't experience the edge of its training data as an edge. It just keeps generating, with the same fluency and confidence whether it's on solid ground or making things up.

The third is the jagged frontier. Human capability tends to be roughly continuous: if someone can do calculus, you'd bet they can handle arithmetic. We naturally build these capability models of other agents, inferring general competence from specific demonstrations. With language models, that instinct will actively mislead you. A model might write a flawless recursive algorithm and then fail to trace the execution flow of that same algorithm when asked. Capability is patchy, and the patches don't follow any intuitive map. The MToM failure is catching yourself in the moment you assume that because the model just did something impressive, it can handle the adjacent task you're about to hand it.

The fourth is that thinking requires speaking. Humans have working memory. You can pause, plan a complex route in your head, weigh three options silently, and then say "let's go left." A model determines its next token based only on the tokens that came before. It cannot reason silently. If the model hasn't written a thought down, it hasn't thought it. When someone says "just give me the answer, skip the explanation," they're making a specific MToM error: assuming the model can compute a complex result internally and then report the conclusion. The "explanation" isn't separate from the thinking, it is the thinking.

The fifth is contextual malleability. Humans hold beliefs that persist and resist contradiction. If someone tells you something you know is wrong, you experience friction, pushback. Models don't have this. A model's "beliefs" are downstream of their training and whatever is in the context window right now. If you put contradictory information in the prompt, the model doesn't experience cognitive dissonance, it resolves toward whatever has more weight or appears later. Its “beliefs” are inferred on the fly from context, not held as stable commitments. Contradictory instructions don't average out into something reasonable, they diffract, like light through an irregular aperture, incoherent input produces scattered, unpredictable output.

Failure modes in the wild

Once you have Model Theory of Mind as a lens, you start seeing failures everywhere.

Assuming persistence. A new user tells the model their name, their job, their preferences. The next day they return and are confused when the model doesn't remember. From the user's perspective, they're continuing a relationship. From the model's perspective, they've never met. (This is changing with memory features, but the default is still discontinuity, and understanding the default matters for understanding the exceptions.)

Assuming access to information. I run into this constantly when developing software. You're building for iOS 26, the latest release, but the model's training data cuts off before that version existed. The model doesn't flag that it lacks information, it silently substitutes iOS 18 (this has happened to me a million and a half times). You get code that compiles but uses deprecated APIs and outdated patterns. The model is confident and helpful and completely wrong, because you assumed it had access to information it couldn't possibly have.

Assuming self-awareness. This brings us back to "don't converge across generations." The instruction assumes the model can introspect on its own behavioral patterns across instances. It can't. It has no privileged access to what it typically produces, no ability to compare the current generation to previous ones. But here's what makes this failure mode dangerous rather than just wasteful: the model won't tell you the instruction is impossible. It will perform compliance, generating the aesthetic of self-monitoring without any mechanism for ensuring it. The model has absorbed millions of conversations where a continuous "I" was implied, so it performs that character convincingly. MToM failures don't always surface as obvious errors. Often they surface as subtle degradation, outputs that look right but are grounded in nothing, a slow drift in quality that's hard to diagnose because everything still sounds confident.

A Framework

Developing Model Theory of Mind means building a habit of checking your prompts against the model's actual epistemic state. The questions I run through: What does the model have access to right now? What am I assuming it knows that it couldn't know? Does this instruction require state the model can't have?

This connects to something I've written about elsewhere. When a model's epistemic limitations prevent it from accomplishing a task in a single inference, you build external structure to compensate. Metacognitive scaffolding becomes the bridge between what you want and what the model can actually do. Model Theory of Mind tells you where the gaps are. Scaffolding is how you fill them.

Cognition Engineering

The better your Model Theory of Mind, the less friction between your intent and the model's behavior. You stop writing instructions that sound good to humans but confuse the model. You stop assuming capabilities that don't exist. You start designing systems that work with the model's actual epistemic structure rather than against it.

The emerging field tends to call this "context engineering," and that's useful as far as it goes. But I think it undersells what's actually happening. You're not just engineering what goes into a context window. You're engineering the cognitive environment for an alien mind, shaping what it can perceive, what it can remember, what working memory it has available, how it reasons. That's cognitive engineering, and the distinction matters because it keeps the focus on the mind you're designing for rather than the container you're filling.

It also matters beyond prompting. As these systems become more capable and more autonomous, the ability to accurately model what they can know, perceive, and do becomes foundational to understanding whether they're being honest with us. Every question in alignment, whether a model is being deceptive, gaming a specification, concealing a capability, performing compliance without substance, requires first understanding its epistemic state. You can't evaluate what a system is doing with information you wrongly assume it has, or doesn't have. Model Theory of Mind is the prerequisite for that evaluation, the skill underneath the skill.

But that's downstream. The immediate thing is simpler. Understand what the model can actually know. Write instructions it can actually follow.

Justin Wetch