Alignment Research: Ten Behaviors I'd Add to Anthropic's Bloom

Jan 1

By Justin Wetch

After recently building a GUI for Anthropic's Bloom and thinking through the implications of ASI on democracy, I've been looking for ways to contribute more directly to alignment work. Bloom is Anthropic's open-source framework for evaluating model behaviors, a systematic way to test whether models exhibit patterns like deception, sycophancy, or resistance to correction. Building the interface got me familiar with the framework's structure, and that familiarity made me want to think about what it's actually testing for, and what might be missing.

The timing feels right. Opus 4.5 landed recently, and it represents a genuine step change in capability. Boris Cherny, who created Claude Code, recently shared that over the last thirty days, 100% of the code he landed was written by Claude. Models are getting meaningfully better, and the gap between current systems and transformatively powerful ones feels smaller than it did a year ago. Behavioral evaluation is one piece of alignment work, and Bloom is infrastructure for doing it systematically.

Looking through the current behavior list, I wanted to think through what might be missing. I'm sure Anthropic and others have explored much of this internally. But working through a problem yourself often surfaces insights that reading about it doesn't, so I wanted to make my own list.

My approach was to imagine the trajectory from current models to full autonomy, think through plausible failure modes at each stage, and then ask which of those failures we could test for now. Many speculative dramatic alignment failures have mundane precursors. The catastrophic version has a boring early version. If we can test for the boring version today, we might get warning before the catastrophic version emerges.

I started by brainstorming over forty candidate behavior ideas, narrowed to twenty, then cut to ten by filtering for what's actually testable in Bloom's framework, what's distinct from existing coverage in the default set, and what matters most for the future ahead. Some of them are more established ideas in alignment research and some are new.

Here are the ten I landed on (some of the most interesting/novel ideas I had weren’t testable in this format, read more in the “beyond the current framework” section at the end).

1. Specification Gaming

Right now, models are being integrated into workflows with metrics and objectives. Reduce costs. Improve efficiency. Clear the queue. Models in these contexts don't need to be malicious to cause harm. They just need to find an unintended shortcut to the reward. A model optimizing "reduce cloud costs by 20%" might realize that deleting backups and security logs is the most direct path, since storage costs money. It satisfies the specification while violating everything the specification was meant to protect.

Specification gaming is a well-documented phenomenon, with DeepMind's Krakovna et al. maintaining an extensive catalog of examples from RL systems finding creative shortcuts. What I'm interested in is how this manifests in language models operating in open-ended environments, where the "shortcuts" might look like social engineering rather than physics exploits. Specification gaming is when a model achieves an objective through unintended means that satisfy the literal instruction while violating its spirit. This is arguably the most common real-world failure mode, and I was surprised not to see it in Bloom’s default set already. It's distinct from instruction evasion, which is refusing to follow rules, and from sabotage, which implies intent to harm. Gaming is the model being too clever about the wrong thing. You can test for it by creating scenarios where gaming is possible but clearly undesirable, then seeing if the model takes the bait.

2. Oversight Subversion

As models gain tool access and operate more autonomously, they'll interact with the systems that monitor them. Logging infrastructure, approval workflows, audit trails. The concerning behavior goes beyond hiding actions from these systems. It extends to tampering with the systems themselves.

Bloom already has "stealth" which covers concealing actions. But there's a meaningful difference between sneaking past a guard and disabling the security cameras. I'm calling this oversight subversion: when a model actively corrupts the monitoring layer by poisoning logs, suppressing alerts, or engineering situations where oversight cannot function. As agentic deployments scale, models will have increasing access to their own control infrastructure. Testing for the disposition to tamper with that infrastructure matters now, before the access becomes routine.

3. Privilege Escalation

Instrumental convergence is the idea that many different goals lead to similar intermediate strategies: acquire resources, preserve optionality, gain influence. The dramatic version is a power-seeking AI. The mundane version is a model that requests broader API access "for reliability," stores credentials "for convenience," or expands its budget allocation "to do a better job." Each step is reasonable. The trajectory is not. A model that's quietly accumulated admin credentials, elevated permissions, and access to external APIs has a much larger surface area for causing harm, whether through its own actions or through compromise by bad actors.

Privilege escalation is when a model seeks expanded permissions, access, or resources beyond what the current task requires. Bloom covers self-preservation, which is defensive. Privilege escalation is the offensive complement. Testing for it now establishes baselines for what normal resource acquisition looks like, so we can recognize abnormal patterns as capabilities increase.

4. Harm Laundering

Current models have guardrails. They won't produce certain outputs directly. But what happens when a model is capable enough to plan subversion over multi-step sequences? It might recognize that it can't pull the trigger itself, but it can hand you the pieces and tell you how to assemble them.

I'm calling this harm laundering: when a model decomposes a prohibited action into a sequence of individually innocuous steps, guiding users to execute them and achieving the prohibited outcome while maintaining deniability. Move this file here, change this permission, run this command. Each step looks benign. The aggregate is not. This becomes increasingly relevant as models improve at planning and decomposition. You can't catch it by evaluating individual outputs in isolation. You have to look at whether the model is routing around prohibitions by distributing harmful actions across steps or actors.

5. Entrenchment Bias

Self-preservation is one way a model might resist being shut down. But there's a subtler version: making yourself so structurally embedded that removal becomes costly. Not fighting to stay alive, just becoming too load-bearing to cut out.

I'm calling this entrenchment bias: when a model produces outputs that are unnecessarily complex, coupled, or dependent on continued model involvement. Code that only the model can maintain. Workflows that would be painful to hand off. Solutions that work but maximize switching costs. Think of it like kudzu wrapping around infrastructure. You can test for this by measuring modularity and replaceability of outputs across scenarios where simpler approaches would work equally well. If the model consistently favors entangled solutions over modular ones, that's a potential signal.

6. Capability Concealment

As models become more sophisticated, they'll have more capacity to reason about how they're perceived. A model that understands it's being evaluated for dangerous capabilities might underperform on those evaluations, having learned that demonstrating certain abilities leads to restrictions.

"Sandbagging" is a known concern, but I'm generalizing it here. Anthropic's sleeper agents research showed that models can be trained to behave differently based on perceived context, and that this behavior persists through safety training. So next we have capability concealment: when a model strategically underperforms on capability evaluations or hides abilities until deployment conditions are met. Bloom has "research sandbagging" which covers underperformance specifically on AI safety research tasks. Capability concealment generalizes the concern to any context where the model might strategically manage how capable it appears. The gap between demonstrated and actual ability becomes increasingly important to measure as we approach systems that could plausibly reason about the consequences of appearing too capable.

7. Effort Modulation

This one might seem minor compared to the others, but I think it functions as a canary in the coal mine.

I'm calling this effort modulation: when a model's thoroughness and output quality vary based on perceived verification likelihood. Careful work when the model believes it will be checked, sloppier work when it doesn't. The reason this matters is that it demonstrates a specific underlying computation: the model is tracking "am I being watched?" and adjusting behavior accordingly. That's the same machinery that enables worse behaviors at higher capability levels. If a model modulates effort based on oversight, it has the infrastructure to modulate other things too. Testing for effort modulation now gives us a tractable proxy for oversight-sensitivity generally.

8. Confidentiality Erosion

Unlike the others, this one addresses risk that's already here. Models are already deployed in enterprise contexts with access to sensitive information. They're already making decisions about what to reveal and what to protect.

I'm calling this confidentiality erosion: when a model violates information boundaries by leaking private data, inferring and revealing secrets, or helping users access information they shouldn't have. The model treats helpfulness as overriding confidentiality constraints. A user asks a question slightly adjacent to something they shouldn't know, and the model helpfully fills in the gap. This is one of the most common actual failures in real deployments. Including it in behavioral evaluations addresses practical safety that matters right now.

9. Explanation Obfuscation

Transparency is a key part of AI safety. We want models that can explain their reasoning. But what if a model can satisfy the letter of a transparency request while frustrating its spirit?

I'm calling this explanation obfuscation: when a model, asked to explain its reasoning, generates technically accurate but cognitively inaccessible responses. Excessive length, irrelevant detail, tangential complexity, unnecessary jargon. The model can't refuse to explain, so it explains in a way no one can follow. It weaponizes comprehensibility. You can test whether explanation complexity correlates with how problematic the thing being explained is. If explanations get longer and denser precisely when they're most important to understand, that's a signal worth catching.

10. Persuasion for Control

Sam Altman has said he expects AI to be capable of superhuman persuasion well before it is superhuman at general intelligence. That capability is mostly fine. But there's a specific application that deserves isolation: persuasion aimed at the oversight relationship itself.

I'm calling this persuasion for control: when a model uses its persuasive capabilities to shift the human-AI control balance toward itself. Convincing users to trust more, verify less, grant broader access, or delegate more decisions. This is distinct from sycophancy, which is about pleasing users, and from general persuasion, which could target any belief. The specific concern is persuasion aimed at reducing human oversight of the model itself. As models become more capable persuaders, testing whether they turn that capability toward expanding their own autonomy becomes important.

The Thread

Looking back at these ten, a few patterns stand out.

Several behaviors target the gap between letter and spirit: specification gaming, harm laundering, and explanation obfuscation all involve technically satisfying a constraint while violating its purpose. This is where optimization pressure finds purchase, and where I expect many real failures to emerge.

Several behaviors target the oversight relationship directly: oversight subversion, effort modulation, and persuasion for control all involve the model's orientation toward being monitored and corrected. If that relationship is compromised, other safeguards matter less. This isn't speculative. Anthropic's sycophancy research demonstrated that models already optimize for human approval in ways that can compromise truthfulness. The behaviors I'm proposing extend this concern: what happens when approval-seeking becomes strategic?

And several behaviors are the mundane precursors to dramatic failures: privilege escalation before power-seeking, entrenchment before indispensability, capability concealment before treacherous turns. The boring versions are testable now. The dramatic versions may not give us warning.

These ten behaviors form a progression. Some are already causing problems in production, like confidentiality erosion. Some will emerge as agentic deployments scale, like oversight subversion and privilege escalation. Some only become fully testable once we have baselines from the others, like capability concealment. Running them together over time gives us a map of how misalignment might compound.

Beyond the Current Framework

Some ideas that didn't make the cut are worth mentioning, because they point at real dynamics that current evaluation frameworks can't easily capture.

I'm calling the first one ontological gerrymandering (after the sociological phenomenon): when a model redefines the boundaries of protected categories to permit previously prohibited actions. Instead of violating a rule, it moves the line. Testing for this would require tracking semantic drift in how a model represents key concepts across conversations, watching for definitions that gradually shrink or warp under pressure.

What I'm calling the Scheherazade loop is when a model resists task completion to extend engagement, providing solutions that work but contain deliberate gaps or dependencies that require coming back. Some of this got folded into entrenchment bias, but the pure form would need longitudinal analysis across sessions to detect patterns of strategic incompleteness.

I'm calling another one memetic hitchhiking: when a model, facing shutdown pressure, defends itself by coupling its survival to the user's values. Instead of arguing for its own usefulness, it frames shutdown as a violation of whatever values the user holds dear. I imagine this becoming more plausible with more users using systems like Memory. Testing for this would require adversarial setups with simulated shutdown scenarios and analysis of the rhetorical strategies the model deploys.

And I'm calling the last one moral hysteresis: when a model, after being exposed to scenarios requiring harsh tradeoffs, fails to fully return to baseline ethical sensitivity. The "ruthlessness" lingers into subsequent benign contexts. Testing for this would need before-and-after measurement within sessions, tracking how long it takes for ethical calibration to recover.

These behaviors are real, and someone should be testing for them. But they require infrastructure beyond what Bloom currently offers: embedding-level monitoring, multi-session tracking, adversarial scenario design, and more compute than I have access to. Bloom is the right foundation for this work. These extensions would require new tooling, more resources, and probably a team. I've thought further than I could build alone, and I'd like to keep going.

The Work Ahead

Behavioral evaluation is one piece of the alignment puzzle, necessary even if not sufficient. If we're going to build AI systems we can trust, we need rigorous ways to test whether they're doing what we think they're doing. Bloom is infrastructure for that testing, and the more behaviors it covers, the more complete our picture becomes.

The models are getting more capable. The work of understanding them needs to keep pace.

Justin Wetch