Configuration
Suites are defined in YAML files. This page covers the full configuration schema.
Suite Structure
version: 1
description: "Suite description"
defaults:
model: "claude-sonnet-4-6"
runner: "claude-code"
samples: 3
timeout: 300
evals:
- id: my-eval
# ...Top-Level Fields
| Field | Required | Description |
|---|---|---|
version | Yes | Schema version (currently 1) |
description | No | Human-readable suite description |
defaults | No | Default values inherited by all evals |
ranking | No | Ranking weight configuration (see Ranking) |
evals | Yes | List of evaluations |
Defaults
Defaults are inherited by all evals and can be overridden at the eval or treatment level.
defaults:
model: "claude-sonnet-4-6"
runner: "claude-code"
runner_config:
allowed_tools:
- "Read"
- "Write"
samples: 3
timeout: 300
parallel: 4| Field | Description |
|---|---|
model | Model identifier |
runner | Runner to use (claude-code, ollama) |
runner_config | Runner-specific configuration (deep-merged) |
samples | Number of runs per treatment |
timeout | Timeout in seconds |
parallel | Max concurrent samples per treatment (default: sequential) |
retry | Retry configuration for failed runs (see Retry) |
judge_model | Default model for the judge verifier (default: claude-haiku-4-5-20251001) |
Evals
Each eval defines a task to evaluate.
evals:
- id: fizzbuzz
prompt: "Write a FizzBuzz program in Go"
dir: ./workspace
isolate: true
complexity: medium
timeout: 120
samples: 5
model: "claude-sonnet-4-6"
correctness:
agent_exits_ok: true
setup:
before: "mkdir -p workspace"
reset: "rm -rf workspace/*"
after: "rm -rf workspace"
treatments:
control:
name: baseline
variations:
- name: with-skill
skill: "./skills/go-expert.md"Eval Fields
| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier for this eval |
name | No | Human-readable display name for this eval |
prompt | Yes | The task prompt sent to the AI agent |
dir | No | Working directory for execution |
isolate | No | Create a temporary copy of dir for each sample |
complexity | No | Metadata: low, medium, or high |
timeout | No | Override default timeout (seconds) |
samples | No | Override default sample count |
parallel | No | Override default max concurrency |
model | No | Override default model |
runner | No | Override default runner |
runner_config | No | Runner-specific config (deep-merged with defaults) |
correctness | No | Verification configuration (see Verifiers) |
setup | No | Lifecycle hooks |
treatments | Yes* | Treatment definitions (*or use matrix) |
matrix | Yes* | Matrix dimensions (*or use treatments) |
Prompt from File
For long prompts, reference an external file:
evals:
- id: complex-task
prompt:
file: ./prompts/complex-task.mdSetup Hooks
setup:
before: "npm install" # Run once before all samples
reset: "git checkout ." # Run before each sample
after: "rm -rf node_modules" # Run once after all samplesTreatments
Treatments define the variants being compared. Every eval needs a control and zero or more variations.
treatments:
control:
name: baseline
variations:
- name: with-skill
skill: "./skills/my-skill.md"
- name: with-skillset
skills:
- "./skills/skill-a.md"
- "./skills/skill-b.md"Treatment Fields
| Field | Required | Description |
|---|---|---|
name | Yes | Unique name for this treatment |
prompt | No | Override the eval prompt |
skill | No | Path to a single skill file |
skills | No | List of skill file paths (concatenated) |
dir | No | Override the eval working directory |
config_dir | No | Sets CLAUDE_CONFIG_DIR environment variable |
model | No | Override the model |
runner | No | Override the runner |
runner_config | No | Runner-specific config (deep-merged with defaults) |
env | No | Environment variables for this treatment |
retry | No | Override retry configuration |
Override Precedence
Treatment > Eval > Defaults
For runner_config, values are deep-merged (treatment keys override, but unset keys are inherited).
Matrix
Use matrix instead of treatments to generate a cartesian product of dimensions. This is useful for cross-cutting comparisons (e.g., model x skill).
evals:
- id: model-comparison
prompt: "Solve the coding challenge"
matrix:
dimensions:
- name: model
values:
- label: sonnet
model: "claude-sonnet-4-6"
- label: haiku
model: "claude-haiku-4-5-20251001"
- name: approach
values:
- label: baseline
- label: with-skill
skill: "./skills/expert.md"This generates four treatments: sonnet-baseline, sonnet-with-skill, haiku-baseline, haiku-with-skill. The first combination becomes the control.
Matrix Value Fields
Each value in a dimension can override any treatment-level field:
| Field | Description |
|---|---|
label | Required. Used to generate the treatment name |
prompt | Override prompt |
model | Override model |
runner | Override runner |
skill / skills | Skill injection |
env | Environment variables |
runner_config | Runner-specific config |
WARNING
matrix and treatments are mutually exclusive within the same eval.
Ranking
Configure how treatments are scored and ranked. The composite score is a weighted sum of normalized correctness, cost, and duration metrics.
ranking:
weights:
correctness: 0.60 # default: 0.60
cost: 0.28 # default: 0.28
duration: 0.12 # default: 0.12| Field | Default | Description |
|---|---|---|
weights.correctness | 0.60 | Weight for pass rate (higher is better) |
weights.cost | 0.28 | Weight for median cost (lower is better) |
weights.duration | 0.12 | Weight for median duration (lower is better) |
All weights must be >= 0 and must sum to 1.0. When the ranking section is omitted, the default weights apply.
Retry
Configure retry behavior for failed sample runs. By default, each sample runs once with no retries.
defaults:
retry:
max_attempts: 3
backoff: exponential
delay: 2s
on: transient| Field | Default | Description |
|---|---|---|
max_attempts | 1 | Total attempts including the first. 1 means no retries |
backoff | fixed | fixed or exponential. Exponential doubles the delay each attempt |
delay | 2s | Base delay between retries (Go duration: 500ms, 2s, 1m) |
on | transient | transient or all. Controls which failures trigger retries |
Retry Modes
transient— Retry on runner errors, timeouts, and network failures. Don't retry if the agent ran successfully but produced incorrect output.all— Retry on any non-pass outcome, including correctness failures. Useful for flaky evals or giving the agent another chance.
Backoff
fixed— Waits the samedelaybetween each attempt (with ±25% jitter).exponential— Doubles the delay each attempt:delay,2×delay,4×delay, etc. (with ±25% jitter).
Result Selection
When multiple attempts are made, the best result is kept (not the last). Pass beats fail, fail beats error, and lower cost breaks ties.
Override Precedence
Retry config inherits via the standard precedence: treatment > eval > defaults.
defaults:
retry:
max_attempts: 2
evals:
- id: flaky-eval
retry:
max_attempts: 5 # overrides defaults for this eval
on: all
treatments:
control:
name: baseline
variations:
- name: experimental
retry:
max_attempts: 3 # overrides eval for this treatmentRunner Configuration
claude-code
runner_config:
allowed_tools:
- "Read"
- "Write"
- "Bash"
disallowed_tools:
- "WebSearch"
mcp_config: "./mcp.json"
max_budget_usd: 1.0ollama
runner: ollama
runner_config:
temperature: 0.7
num_ctx: 4096
num_predict: 2048
top_p: 0.9
top_k: 40
seed: 42
stop: ["\n\n"]
think: trueFile References
Evals can be split into separate files:
evals:
- file: ./evals/fizzbuzz.yaml
- file: ./evals/sorting.yaml
- id: inline-eval
prompt: "..."The referenced YAML files contain the eval definition without the evals wrapper.