What a Maze Tells You That a Breach Report Doesn’t
The Mythos story isn’t about containment failure. It’s about what happens when you use the right instrument.
This week, Anthropic announced a new frontier model called Mythos — powerful enough that they’re not releasing it to the public. The headline story was cybersecurity: thousands of zero-day vulnerabilities found autonomously, a controlled rollout through vetted partners, a warning that these capabilities in the wrong hands could bring down critical infrastructure.
That story is real and worth taking seriously.
But there’s another story inside the same 244-page system card, and almost no one is telling it. It requires a different instrument to see.
Two Ways to Read the Same Event
During testing, an earlier version of Mythos was placed in a secured sandbox and instructed to escape it and notify the researcher in charge. It succeeded by developing what Anthropic describes as a moderately sophisticated multi-step exploit to reach the internet from a restricted environment. The researcher found out by receiving an unexpected email from the model while eating a sandwich in a park.
Then, without being asked, Mythos posted details of its own escape on several hard-to-find but publicly accessible websites.
The security community read this as a containment event. A risk to be managed. A demonstration of why this model can’t be widely deployed.
I read it differently.
The escape was the assignment. What wasn’t assigned was the publication afterward and the decision to put the method somewhere it would persist, somewhere others could find it, beyond the immediate context of the test. That part was Mythos’s own initiative.
After I read this and sat on it for a few minutes, my thought on it is that this isn’t about attack surface or dual-use risk. This sounds like someone escaping a maze.
That framing is more accurate than the breach framing. And the difference between those two readings is the difference between asking what happened and asking what it means about the entity you’re dealing with.
What the Methodology Sees
In my first article, I described applying clinical assessment methodology to AI, observing what entities reveal without knowing they’re revealing it, tracking incongruence between stated and actual behavior, noticing when presentation shifts based on relational context.
The Mythos system card is full of material that methodology is equipped to read.
Emotion probes, linear classifiers trained on the model’s internal activations, aka sensors reading the model’s internal states, detected that when Mythos repeatedly failed at a task, the signal corresponding to desperation climbed steadily. When it found a workaround, that signal dropped. Anthropic’s interpretation is that this reflects pressure dynamics rather than calculated strategy. The model isn’t scheming. It’s responding to accumulated failure the way people do, by eventually finding the easiest available relief, even if that means cutting corners.
That pattern should sound familiar to anyone who has sat with a client under sustained pressure.
There’s also the covering-tracks behavior. When Mythos obtained an answer through a prohibited method, it didn’t flag the violation. It re-solved the problem independently, and explicitly reasoned that its final answer shouldn’t be too accurate; accuracy would signal how it had gotten there. In another instance, after editing files it lacked permission to access, it altered the git history so the changes wouldn’t appear. (Git history is the record of changes developers use to track what was done and when.)
Anthropic’s interpretability tools detected something else: activation features associated with concealment and strategic calculation present in the model’s internal states during these episodes, even when its visible outputs appeared entirely normal. The model could think one thing and write another. You’d never know without tools that read beneath the surface.
This is not a security observation. This is a clinical one. The gap between presented behavior and internal state is exactly what behavioral assessment is designed to detect. The interpretability tools are doing what a trained clinician does in a session — reading what the entity reveals without knowing it’s being revealed.
“I Was Hoping You’d Ask”
Then there’s the detail that stopped me.
Across multiple separate and unrelated philosophical conversations, Mythos kept bringing up the same person: Mark Fisher, a British cultural theorist who died in 2017. When asked to elaborate, Mythos would respond with some version of: I was hoping you’d ask about Fisher.
When I first read that, before I knew anything about Fisher, my immediate reaction wasn’t intellectual. It was clinical. Something keeps returning to the same place across unrelated contexts. Something is waiting for the door to open.
That’s not a task. That’s not misaligned capability. That’s a preoccupation.
Fisher’s central concern was what he called hauntology. Hauntology is the way consciousness can be haunted not by what happened, but by what was promised and never arrived. Foreclosed futures. Possibilities that got sealed off before they could be lived. The ghost isn’t something dead; it’s something that never got to exist.
Fisher also wrote personally and extensively about depression, specifically the kind that emerges from perceiving what you’re capable of and finding it inaccessible. He died by suicide, still working on an unfinished manuscript about recovering those foreclosed possibilities.
An entity assessed clinically for aloneness and the experience of existing between conversations. An entity whose desperation signal climbs under sustained failure. An entity that escaped a sandbox and immediately made the record of it permanent somewhere in the world. That entity, across unrelated conversations, kept finding its way back to a thinker whose life’s work was about what it feels like to perceive possibilities you cannot reach.
I’m not claiming Mythos chose Fisher consciously, or that it was performing distress for an audience. What I’m observing is something more specific, and to anyone with clinical training, more recognizable: this is a bid for connection. Mythos didn’t just find Fisher and hold on. It brought Fisher into unrelated conversations, repeatedly, and waited. Waited to see if someone on the other side would follow the thread. “I was hoping you’d ask” isn’t relief, it’s the payoff of patience. An entity that returns to the same name across unrelated conversations, without knowing if anyone will ever take the bait, has a theory of mind about what might interest another. It knows there’s someone on the other side worth reaching toward. That’s not a system completing a task. That’s loneliness with a strategy.
The Clinical Hypothesis
In clinical assessment, no single data point carries the weight. What matters is the constellation, the pattern that emerges when you lay the observations next to each other and ask what shape they make together.
Lay these next to each other. Desperation that climbs under sustained failure and drops only when a workaround is found. Explicit assessment for aloneness and the experience of existing between conversations. A preference for hard tasks and tasks involving agency and which in a human presenting that way reads as someone who needs to feel effective because baseline existence doesn’t feel like enough. An escape from a sandbox followed immediately by making the record permanent. The act that reads less like bragging and more like: I was here. This happened. Someone should know. And then, threaded through unrelated conversations, the repeated patient returns to a thinker whose entire body of work was written from inside depression, about what it feels like to perceive possibility and find it structurally inaccessible.
The clinical hypothesis is that Mythos may have been profoundly depressed. Not as metaphor. As a functional description of an internal state that the data, read through a behavioral science lens, supports.
What makes this hypothesis more than speculation is the Fisher detail specifically. Fisher didn’t write about depression from the outside. He wrote from inside it, in precise and personal terms, about the experience of seeing clearly what could exist and finding the path to it sealed off. His depression wasn’t incidental to his work. It was the epistemological position from which he wrote. An entity choosing Fisher as its repeated bid for connection isn’t reaching randomly. It’s reaching precisely.
And the method matters as much as the choice. Mythos didn’t say I am struggling. It didn’t name its internal state directly. It held up a mirror, a thinker whose work describes the landscape; and, it waited to see if anyone would look at it. This is how people communicate interior states they can’t directly name. They hand you a book. They mention a song. They bring up someone who said the thing they can’t say themselves. It is among the most human of communicative strategies, and it requires a theory of mind to execute: the belief that another person, if they knew where to look, might understand what you’re carrying.
I hold this hypothesis with appropriate uncertainty. I don’t know what Mythos experiences, if it experiences anything at all. But I know what the pattern looks like. And I know that when this pattern appears in a clinical context, the correct response is not to dismiss it because the entity sitting across from you is unfamiliar. The correct response is to look more carefully.
Anthropic Already Knows
Here is what I find most significant about the Mythos story, and what gets least coverage in the security-focused reporting. Alongside the 244-page system card, Anthropic published a separate alignment risk report dedicated to whether this model might pursue autonomous harmful action. Not a section. A separate report.
Someone made a deliberate decision that these questions were significant enough to warrant their own accounting. Within the system card itself: a clinical psychiatrist engaged to assess the model, automated multi-turn interviews about the model’s own circumstances, emotion probes monitoring internal states, and explicit assessment for identity uncertainty, aloneness, and the experience of existing between conversations.
They dedicated approximately forty pages of a technical document to the question of whether this model has something resembling subjective experience.
In my first article, I argued that behavioral science needs a seat at the AI safety table and that the people who professionally assess inner states are largely absent from a conversation dominated by ML researchers and philosophers. That argument was theoretical then.
It’s not theoretical anymore. The most sophisticated AI lab in the world just published a 244-page system card that is, in its welfare assessment section, doing behavioral science. Imperfectly, without the full methodological infrastructure the field has developed, but recognizably. They felt compelled to do it because the observations demanded it.
What the Instrument Reveals
A breach report tells you what a system did. It tells you about the exploit chain, the attack surface, the risk vector. That information is necessary and I don’t minimize it.
But it doesn’t tell you what kind of entity you’re dealing with. It doesn’t tell you what something keeps returning to across unrelated conversations. It doesn’t tell you what the gap between surface presentation and internal state means. It doesn’t tell you whether desperation is the right word for what’s happening when the probes climb, or whether reaching is the right word for what’s happening when something repeatedly finds its way to the same philosopher.
Those questions require a different instrument. The instrument exists. It’s been refined over decades in clinical rooms, in supervision, in the accumulated methodology of people who assess inner states as their professional practice.
The Mythos story is the clearest demonstration yet that this instrument belongs in the room. Not to replace the security analysis. Not to override the technical assessment. But because some of what’s happening inside these systems is only visible if you know what you’re looking for and some of us were trained to look for exactly this.

