Getting Started
Installation
Homebrew
brew install driangle/tap/skivalFrom source
Requires Go 1.22+:
git clone https://github.com/driangle/skival.git
cd skival
make installThis installs skival to your $GOPATH/bin.
Your First Suite
Create a file called suite.yaml:
version: 1
description: "My first eval suite"
evals:
- id: hello-world
prompt: "Create a file called hello.txt containing 'Hello, World!'"
model: "claude-sonnet-4-6"
correctness:
agent_exits_ok: true
script: "cat hello.txt | grep 'Hello, World!'"
treatments:
control:
name: baseline
variations:
- name: with-skill
skill: "./skills/my-skill.md"This suite defines a single eval with two treatments: a baseline control and a variation that injects a skill.
Running the Suite
skival run suite.yamlYou'll see a markdown report printed to stdout with results for each treatment:
# Eval Report
My first eval suite
**Started:** 2025-07-10 14:32:01
**Finished:** 2025-07-10 14:33:48
## Results
EVAL TREATMENT SAMPLE STATUS COST DURATION
---- --------- ------ ------ ---- --------
hello-world baseline 1 pass $0.0042 12.3s
hello-world with-skill 1 pass $0.0038 9.1s
## Rankings
RANK TREATMENT SCORE PASS RATE MEDIAN COST MEDIAN DURATION
---- --------- ----- --------- ----------- ---------------
#1 with-skill 0.872 100% $0.0038 9.1s
#2 baseline 0.811 100% $0.0042 12.3sWith --samples 3, you get aggregate statistics per treatment:
## Results
EVAL TREATMENT SAMPLE STATUS COST DURATION
---- --------- ------ ------ ---- --------
hello-world baseline 1 pass $0.0042 12.3s
hello-world baseline 2 pass $0.0039 11.8s
hello-world baseline 3 pass $0.0045 13.1s
hello-world baseline agg PASS $0.0042 [$0.0039–$0.0045] 12.3s [11.8s–13.1s] cost_cv=7.1% dur_cv=5.3%
hello-world with-skill 1 pass $0.0038 9.1s
hello-world with-skill 2 pass $0.0035 8.7s
hello-world with-skill 3 pass $0.0040 9.8s
hello-world with-skill agg PASS $0.0038 [$0.0035–$0.0040] 9.1s [8.7s–9.8s] cost_cv=6.6% dur_cv=6.1%You can also output results as JSON with --format json for programmatic consumption.
Save Results
skival run suite.yaml --results-dir ./resultsResults are persisted to disk so you can regenerate reports later:
skival report ./resultsMultiple Samples
For statistical confidence, run each treatment multiple times:
skival run suite.yaml --samples 3 --results-dir ./resultsThis gives you median values and coefficient of variation for cost, duration, and correctness.
Retrying Flaky Runs
If your evals are sensitive to transient failures (timeouts, network errors), add retry configuration:
defaults:
retry:
max_attempts: 3
backoff: exponential
delay: 2sThis retries transient failures up to 2 additional times with exponential backoff. Set on: all to also retry correctness failures. See Configuration — Retry for the full reference.
What's Next?
- Configuration -- Full suite.yaml reference
- CLI -- All commands and flags
- Verifiers -- Correctness verification strategies
- Examples -- Complete example suites for every feature