Control vs. VariationsDefine a baseline and any number of variations -- different skills, models, prompts, or configs -- and run them against the same tasks.
Know If It's CorrectCheck results with exit codes, expected output, custom scripts, HTTP assertions, or an LLM judge. Chain them together.
Run It Enough TimesOne run is anecdotal. Run each variation multiple times and get medians, variance, and confidence that the difference is real.
Cost, Speed, CorrectnessEvery run tracks dollar cost, token usage, and time to completion alongside pass/fail -- so you can find the cheapest correct answer.