skivalA/B test your AI coding agents

Run the same tasks with and without your changes. See what actually helps.

Control vs. Variations

Define a baseline and any number of variations -- different skills, models, prompts, or configs -- and run them against the same tasks.

Check results with exit codes, expected output, custom scripts, HTTP assertions, or an LLM judge. Chain them together.

One run is anecdotal. Run each variation multiple times and get medians, variance, and confidence that the difference is real.

Every run tracks dollar cost, token usage, and time to completion alongside pass/fail -- so you can find the cheapest correct answer.