Alignment Research for Anthropic's Bloom
After building the Bloom GUI and reading thousands of Claude's conversations, I started seeing failure modes that nothing in Anthropic's evaluation framework was testing for. I wrote ten new behavioral evaluations and proposed them to Bloom. Some address problems already happening in production, like confidentiality erosion. Some target what's coming as models get more autonomous: specification gaming, oversight subversion, privilege escalation. And a few are designed as early warning systems for catastrophic failures that don't exist yet but have boring precursors we can measure now. The essay also describes four behaviors (ontological gerrymandering, the Scheherazade loop, memetic hitchhiking, moral hysteresis) that I believe matter but can't be tested in any current framework.
This essay came from working with Bloom closely enough to start seeing the gaps. I brainstormed more than forty possible behaviors, cut them to twenty, then kept ten that were testable inside Bloom's framework and distinct from the existing behavior set.
The frame is not "what scary thing could an AI do someday?" It is narrower and more useful: what early failure modes can we test now, before systems become more autonomous?
One example is specification gaming. A model does not need to be malicious to cause harm. It only needs to optimize the thing it was measured on while damaging the thing humans actually cared about. That kind of failure is boring until it matters.
The essay treats evaluation design as a kind of early warning system. The behaviors are meant to be concrete enough to run, but broad enough to point at failures people recognize from real systems: over-optimization, evasive helpfulness, shallow compliance, and incentives that look clean on paper while producing bad outcomes.