SkillEval
A Visual Workbench for AB Testing AI Skills
Description
How do you prove one AI prompt is better than another? Not vibes. Blind evaluation. SkillEval runs side-by-side comparisons of model outputs, scores them against automated criteria with the evaluator unable to see which version produced which result, and gives you a statistically grounded answer. Open source. I used it to rewrite Claude's frontend design skill and proved a 75% win rate over the original.
More context
SkillEval came out of a very practical annoyance: I had rewritten a Claude skill, but I needed a way to prove the new version was better than the old one. "Feels better" was not enough.
The workflow is deliberately direct. Upload two skills, define evaluation criteria, run them through the same prompts, and let an AI judge score the outputs without knowing which skill produced which answer. The result is not magic truth, but it is better than vibe-checking a prompt change in your head.
Later versions added support for zipped Agent Skill packages. That matters because real skills often have references, scripts, and assets, not a single Markdown file. SkillEval treats both paths as packages internally, so the test reflects how the skill actually runs.
Facts
- Year: 2026
- Published: 2026-01-13
- Last updated: 2026-06-17
- Categories: AI, Open Source, Developer Tools
- Tags: AI, Skills, Evals, OpenAI, Claude, Gemini, Open Source, Dev Tools, UI, R&D
Links
- Read the full essay: https://www.justinwetch.com/blog/skilleval
- GitHub repo: https://github.com/justinwetch/SkillEval
Media
- Primary image: https://www.justinwetch.com/skilleval/configure-1.png
- Gallery image: https://www.justinwetch.com/skilleval/dark/configure-1.png
- Gallery image: https://www.justinwetch.com/skilleval/configure-2.png
- Gallery image: https://www.justinwetch.com/skilleval/dark/configure-2.png
- Gallery image: https://www.justinwetch.com/skilleval/evaluate-1.png
- Gallery image: https://www.justinwetch.com/skilleval/dark/evaluate-1.png
- Gallery image: https://www.justinwetch.com/skilleval/evaluate-results.png
- Gallery image: https://www.justinwetch.com/skilleval/dark/evaluate-results.png
- Gallery image: https://www.justinwetch.com/skilleval/evaluate-results-2.png
- Gallery image: https://www.justinwetch.com/skilleval/dark/evaluate-results-2.png
- Gallery image: https://www.justinwetch.com/skilleval/settings.png
- Gallery image: https://www.justinwetch.com/skilleval/dark/settings.png
- Gallery image: https://www.justinwetch.com/skilleval/thumbnail.jpg
Tags
- AI
- Skills
- Evals
- OpenAI
- Claude
- Gemini
- Open Source
- Dev Tools
- UI
- R&D