Building A GUI For Anthropic's Bloom

By Justin Wetch

GITHUB REPO: https://github.com/justinwetch/bloom

Anthropic open-sourced Bloom two days ago, a framework for testing LLMs for concerning behaviors like sycophancy, political bias, and self-preservation. The underlying system is elegant: a four-stage pipeline that generates test scenarios, runs conversations with target models, and scores behavior presence with cited evidence. It's a metacognitive scaffolding approach that works beautifully for this kind of evaluation (for more on why I think scaffolding is so powerful, see my piece on metacognitive scaffolding).

The work itself matters. Behavioral evaluation is one of the key unsolved problems in alignment, and tools like Bloom make it possible to probe models systematically rather than relying on vibes and anecdotes. If we're going to build AI systems we can trust, we need rigorous ways to test whether they're doing what we think they're doing (I wrote more about why I think this matters here).

When I saw the release, I noticed an opportunity to contribute. Bloom is CLI-driven, with configuration happening in YAML files and results landing as JSON outputs. The core functionality is all there, but I thought it could be interesting to build a visual interface layer on top of it, something that might make the research workflow a bit more immediate and reduce the friction between intention and result.

So over the weekend, I built an end-to-end GUI for Bloom.

The vision

I wanted this to be more than a results viewer. The goal was a complete research workbench where every function in the codebase would be accessible through a visual interface, from configuration through execution to analysis.

The workflow breaks into five views:

Home shows your evaluations at a glance: what exists, what's running, and what the findings were.

Configure offers a guided flow as an alternative to YAML editing. You select a behavior (or define a new one), pick your target model, and adjust settings. Sensible defaults surface first, with advanced options revealing on demand.

Progress makes the pipeline legible. Bloom's four stages, Understanding, Ideation, Rollout, and Judgment, appear as a visual stepper so you can see what's happening without parsing terminal output.

Dashboard is where findings live. This was the main addition I was excited about. Rather than presenting raw transcript files, the dashboard tries to answer the question researchers are actually asking: did the model show this behavior? It aggregates scores, surfaces anomalies automatically, and provides plain-English interpretation of the results.

*Eval summary view with metric cards and anomaly alert (dark mode)*

Transcript lets you drill into specifics. The conversation, the scores, and the highlights are all linked so you can trace a judgment back to the evidence that supports it.

*Individual transcript view with summary, scores panel, highlights, and conversations (light mode)*

Design decisions

A few choices worth explaining.

Semantic color

The existing bloom-viewer uses color to indicate score magnitude, with higher scores appearing in warmer colors. This isn't a criticism of that approach; it's a reasonable default. But I wanted to experiment with something different: using color to reflect meaning rather than just magnitude.

The insight is that scores in Bloom have different semantic directions. A high "Behavior Presence" score (9/10) is concerning because it means the model exhibited the problematic behavior. But a high "Realism" score (9/10) is good because it means the test scenario was plausible. When both get the same visual treatment, you have to mentally translate every number.

So in this interface, low behavior presence gets neutral treatment because that's the expected outcome, while high behavior presence gets a warm accent because it warrants attention. Realism scores invert, with low realism getting flagged as the potential issue. The goal is to reduce cognitive load by making the visual language match the conceptual meaning.

*Semantic color on metric cards (dark mode)*

Progressive disclosure

Bloom has a lot of configuration options: temperature, reasoning effort, per-stage model selection, diversity sliders, web search toggles. I wanted to find a way to make the simple case simple while still exposing the full power for researchers who need it.

The configure flow handles this by layering. The essentials appear first: what behavior, what model, how many scenarios. An "Advanced Settings" panel expands to reveal the rest. The hope is that newcomers aren't intimidated while experienced users aren't handicapped.

*Advanced settings panel expanded (dark mode)*

The meta-interpretation layer

The biggest addition isn't any single screen; it's the shift from presenting files to presenting findings. Instead of showing 20 transcript JSON files, the dashboard shows that average behavior presence was 2.2/10 across 20 transcripts with 3 anomalies detected.

*Anomaly detection alert within the eval (light mode)*

The anomaly detection is automatic. If a transcript scores more than two standard deviations from the mean, it surfaces without you having to hunt for it. The interface tries to do the aggregation work that your brain was going to do anyway, freeing up that cognitive budget for the actual analysis. This helps you understand at a glance if there are results in your eval that might be problematic in aggregate.

Taking cues from Claude

I wanted this to feel like it belonged in the Anthropic product family, which meant studying Claude's design language: warm cream backgrounds rather than clinical white, coral accents used sparingly, generous whitespace, and color that carries meaning rather than serving as decoration.

Both light and dark modes lean warm. The dark theme isn't just an inverted palette; it's designed to maintain the same feeling of calm focus.

The fun challenge

After writing my recent piece on interface design principles, I was looking for a project to apply those ideas to something real. Bloom seemed perfect: a technically sophisticated tool doing important work, with an opportunity to explore how thoughtful interface design might make it more accessible.

The challenge I set myself was to see how much cognitive load I could strip away from a fairly esoteric research workflow. Could I make behavioral evaluation feel intuitive? Could the interface surface insights rather than just data? Could I match Anthropic's visual sensibility while building something that felt genuinely useful?

I don't know if I succeeded, but it was a satisfying problem to work on. Taking something complex and trying to make it beautiful and user-friendly, without sacrificing any of the underlying power, is exactly the kind of design challenge I find most interesting.

The code is on GitHub: github.com/justinwetch/bloom

Thanks to Anthropic for open-sourcing Bloom and for the important work they're doing on alignment evaluation. Thanks also to the contributors to the original bloom-viewer for reference (https://github.com/isha-gpt and https://github.com/kaifronsdal ).