Examples
The examples/ directory contains self-contained suites that demonstrate each skival feature. Clone the repo and run any example directly:
skival run examples/minimal/suite.yamlMinimal
The simplest valid suite — one eval, two treatments.
version: 1
evals:
- id: hello
prompt: "Write a file called hello.txt containing 'Hello, World!'"
model: "claude-sonnet-4-6"
treatments:
control:
name: baseline
variations:
- name: opus-model
model: "claude-opus-4-6"Defaults
Suite-level defaults inherited by all evals. Individual evals can override any default.
version: 1
defaults:
model: "claude-sonnet-4-6"
samples: 5
timeout: 120
runner: claude-code
runner_config:
max_turns: 10
evals:
- id: uses-defaults
prompt: "Create a file called greeting.txt with a friendly greeting."
treatments:
control:
name: baseline
- id: overrides-defaults
prompt: "Write a script that prints the current date."
samples: 3 # overrides suite default
timeout: 60 # overrides suite default
model: "claude-opus-4-6"
treatments:
control:
name: baselineFile References
Split eval definitions into separate files to keep large suites organized.
version: 1
defaults:
model: "claude-sonnet-4-6"
samples: 2
evals:
- file: evals/string-reverse.yaml
- file: evals/fibonacci.yamlCorrectness Verification
Every verification mode in one suite: compiles, agent_exits_ok, output, script, state, and judge.
evals:
# Verify compilation
- id: compiles-check
prompt: "Write a Go program in main.go that prints 'hello'."
correctness:
compiles: "go build ./..."
# Verify the agent exits successfully
- id: agent-exits-ok-check
prompt: "Write a shell script run.sh that exits with code 0."
correctness:
agent_exits_ok: true
# Verify expected strings in output
- id: expected-output-check
prompt: "Write a script that prints 'PASS: all tests green'."
correctness:
output:
contains:
- "PASS"
- "all tests green"
# Verify with a custom script
- id: script-check
prompt: "Create output.txt containing exactly 'hello world'."
correctness:
script: "./scripts/verify-output.sh"
# Verify HTTP state
- id: state-check
prompt: "Start a web server on port 8080 that responds to GET /health with 'ok'."
correctness:
state:
- url: "http://localhost:8080/health"
method: GET
expect: "ok"
# Verify with an LLM judge
- id: judge-check
prompt: "Write a README.md explaining how to set up a Go project."
correctness:
judge:
- "Does the README explain how to initialize a Go module?"
- "Does it include instructions for running tests?"
# Combine multiple checks
- id: combined-check
prompt: "Write calc.py that reads two numbers and prints their sum."
correctness:
agent_exits_ok: true
output:
contains:
- "Result:"
script: "./scripts/verify-calc.sh"Setup Hooks
Lifecycle hooks for fixture creation and cleanup: before runs once at the start, reset runs between samples, after runs once at the end.
evals:
- id: with-hooks
prompt: "Read input.txt and write its contents reversed to output.txt."
isolate: true
setup:
before: |
echo "dlrow olleh" > input.txt
reset: |
rm -f output.txt
after: |
rm -f input.txt output.txt
correctness:
agent_exits_ok: trueComplexity Levels
Tag evals by difficulty and adjust sample counts and timeouts accordingly.
evals:
- id: low-complexity
prompt: "Create a file called hello.txt containing 'hello'."
complexity: low
samples: 5
timeout: 30
- id: medium-complexity
prompt: "Write a Python Flask app with a GET /users endpoint."
complexity: medium
samples: 3
timeout: 120
- id: high-complexity
prompt: "Build a complete TODO app with SQLite, CRUD, and validation."
complexity: high
samples: 2
timeout: 300Multiple Treatments
Compare a baseline control against multiple variations with different models, skills, environment variables, or runner configs.
evals:
- id: sort-algorithm
prompt: "Write sort.py that reads integers from stdin and prints them sorted."
model: "claude-sonnet-4-6"
treatments:
control:
name: baseline
variations:
- name: with-skill
skill: "./skills/python-best-practices.md"
- name: opus-model
model: "claude-opus-4-6"
- name: with-env
env:
STYLE: "functional"
- name: custom-runner-config
runner_config:
max_turns: 5
allowed_tools: [Read, Write, Bash]Multi-Runner
Compare different runners (claude-code, codex, aider) in the same suite.
evals:
- id: cross-runner
prompt: "Write primes.py that prints all primes less than 100."
correctness:
agent_exits_ok: true
treatments:
control:
name: claude-code
model: "claude-sonnet-4-6"
runner: claude-code
variations:
- name: codex
model: "gpt-4.1"
runner: codex
- name: aider
model: "claude-sonnet-4-6"
runner: aiderMatrix Comparison
Use matrix instead of treatments to generate cross-product combinations from multiple dimensions.
evals:
- id: hello-world
prompt: "Write hello.sh that prints 'Hello, World!' to stdout."
correctness:
script: "./verify.sh"
matrix:
dimensions:
- name: runner
values:
- label: claude-code
runner: claude-code
- label: codex
runner: codex
- name: model
values:
- label: opus
model: claude-opus-4-6
- label: sonnet
model: claude-sonnet-4-6This generates four treatments: claude-code × opus, claude-code × sonnet, codex × opus, codex × sonnet.
Skillset Comparison
Compare no skill vs. a single skill vs. a composed skillset using skills (plural).
evals:
- id: fizzbuzz-skillset
prompt: "Write fizzbuzz.sh that prints FizzBuzz for 1-20."
treatments:
control:
name: "baseline"
variations:
- name: "shell-only"
skill: "./skills/shell-best-practices.md"
- name: "shell-and-testing"
skills:
- "./skills/shell-best-practices.md"
- "./skills/testing-guidelines.md"Runner Config Precedence
Runner config merges at three levels: defaults → eval → treatment. Each level overrides the one above.
defaults:
runner: claude-code
runner_config:
max_turns: 20
allowed_tools: [Read, Write, Bash, Glob, Grep]
evals:
- id: eval-override
prompt: "Write a test suite for a calculator module."
runner_config:
max_turns: 30 # overrides default
permission_mode: "plan" # new key, merged with defaults
- id: treatment-override
prompt: "Refactor the utils module."
runner_config:
max_turns: 25
treatments:
control:
name: baseline
variations:
- name: restricted
runner_config:
max_turns: 10 # overrides eval's 25
allowed_tools: [Read, Edit]Per-Treatment Config
Override prompts and config_dir per treatment for full control over Claude Code settings, hooks, and MCP configuration.
evals:
- id: prompt-comparison
treatments:
control:
name: baseline
prompt: "Write a function that checks if a string is a palindrome."
variations:
- name: with-tests
prompt: "Write a palindrome checker with comprehensive pytest tests."
- id: config-comparison
prompt: "List the files in the current directory"
treatments:
control:
name: strict-config
config_dir: "./configs/strict"
variations:
- name: permissive-config
config_dir: "./configs/permissive"FizzBuzz
A complete real-world example with a skill file and verification script.
version: 1
description: "FizzBuzz benchmark — compare baseline vs shell best-practices skill"
defaults:
model: "claude-sonnet-4-6"
samples: 3
timeout: 60
evals:
- id: fizzbuzz-basic
prompt: |
Write fizzbuzz.sh that prints FizzBuzz output for numbers 1 through 20.
Rules: "Fizz" for multiples of 3, "Buzz" for 5,
"FizzBuzz" for both, the number otherwise.
complexity: low
setup:
reset: "rm -f fizzbuzz.sh"
correctness:
agent_exits_ok: true
script: "./verify.sh"
treatments:
control:
name: "baseline"
variations:
- name: "with-shell-skill"
skill: "./skills/shell-best-practices.md"