Skip to content

Configuration

Suites are defined in YAML files. This page covers the full configuration schema.

Suite Structure

yaml
version: 1
description: "Suite description"
defaults:
  model: "claude-sonnet-4-6"
  runner: "claude-code"
  samples: 3
  timeout: 300
evals:
  - id: my-eval
    # ...

Top-Level Fields

FieldRequiredDescription
versionYesSchema version (currently 1)
descriptionNoHuman-readable suite description
defaultsNoDefault values inherited by all evals
rankingNoRanking weight configuration (see Ranking)
evalsYesList of evaluations

Defaults

Defaults are inherited by all evals and can be overridden at the eval or treatment level.

yaml
defaults:
  model: "claude-sonnet-4-6"
  runner: "claude-code"
  runner_config:
    allowed_tools:
      - "Read"
      - "Write"
  samples: 3
  timeout: 300
  parallel: 4
FieldDescription
modelModel identifier
runnerRunner to use (claude-code, ollama)
runner_configRunner-specific configuration (deep-merged)
samplesNumber of runs per treatment
timeoutTimeout in seconds
parallelMax concurrent samples per treatment (default: sequential)
retryRetry configuration for failed runs (see Retry)
judge_modelDefault model for the judge verifier (default: claude-haiku-4-5-20251001)

Evals

Each eval defines a task to evaluate.

yaml
evals:
  - id: fizzbuzz
    prompt: "Write a FizzBuzz program in Go"
    dir: ./workspace
    isolate: true
    complexity: medium
    timeout: 120
    samples: 5
    model: "claude-sonnet-4-6"
    correctness:
      agent_exits_ok: true
    setup:
      before: "mkdir -p workspace"
      reset: "rm -rf workspace/*"
      after: "rm -rf workspace"
    treatments:
      control:
        name: baseline
      variations:
        - name: with-skill
          skill: "./skills/go-expert.md"

Eval Fields

FieldRequiredDescription
idYesUnique identifier for this eval
nameNoHuman-readable display name for this eval
promptYesThe task prompt sent to the AI agent
dirNoWorking directory for execution
isolateNoCreate a temporary copy of dir for each sample
complexityNoMetadata: low, medium, or high
timeoutNoOverride default timeout (seconds)
samplesNoOverride default sample count
parallelNoOverride default max concurrency
modelNoOverride default model
runnerNoOverride default runner
runner_configNoRunner-specific config (deep-merged with defaults)
correctnessNoVerification configuration (see Verifiers)
setupNoLifecycle hooks
treatmentsYes*Treatment definitions (*or use matrix)
matrixYes*Matrix dimensions (*or use treatments)

Prompt from File

For long prompts, reference an external file:

yaml
evals:
  - id: complex-task
    prompt:
      file: ./prompts/complex-task.md

Setup Hooks

yaml
setup:
  before: "npm install"     # Run once before all samples
  reset: "git checkout ."   # Run before each sample
  after: "rm -rf node_modules"  # Run once after all samples

Treatments

Treatments define the variants being compared. Every eval needs a control and zero or more variations.

yaml
treatments:
  control:
    name: baseline
  variations:
    - name: with-skill
      skill: "./skills/my-skill.md"
    - name: with-skillset
      skills:
        - "./skills/skill-a.md"
        - "./skills/skill-b.md"

Treatment Fields

FieldRequiredDescription
nameYesUnique name for this treatment
promptNoOverride the eval prompt
skillNoPath to a single skill file
skillsNoList of skill file paths (concatenated)
dirNoOverride the eval working directory
config_dirNoSets CLAUDE_CONFIG_DIR environment variable
modelNoOverride the model
runnerNoOverride the runner
runner_configNoRunner-specific config (deep-merged with defaults)
envNoEnvironment variables for this treatment
retryNoOverride retry configuration

Override Precedence

Treatment > Eval > Defaults

For runner_config, values are deep-merged (treatment keys override, but unset keys are inherited).

Matrix

Use matrix instead of treatments to generate a cartesian product of dimensions. This is useful for cross-cutting comparisons (e.g., model x skill).

yaml
evals:
  - id: model-comparison
    prompt: "Solve the coding challenge"
    matrix:
      dimensions:
        - name: model
          values:
            - label: sonnet
              model: "claude-sonnet-4-6"
            - label: haiku
              model: "claude-haiku-4-5-20251001"
        - name: approach
          values:
            - label: baseline
            - label: with-skill
              skill: "./skills/expert.md"

This generates four treatments: sonnet-baseline, sonnet-with-skill, haiku-baseline, haiku-with-skill. The first combination becomes the control.

Matrix Value Fields

Each value in a dimension can override any treatment-level field:

FieldDescription
labelRequired. Used to generate the treatment name
promptOverride prompt
modelOverride model
runnerOverride runner
skill / skillsSkill injection
envEnvironment variables
runner_configRunner-specific config

WARNING

matrix and treatments are mutually exclusive within the same eval.

Ranking

Configure how treatments are scored and ranked. The composite score is a weighted sum of normalized correctness, cost, and duration metrics.

yaml
ranking:
  weights:
    correctness: 0.60    # default: 0.60
    cost: 0.28           # default: 0.28
    duration: 0.12       # default: 0.12
FieldDefaultDescription
weights.correctness0.60Weight for pass rate (higher is better)
weights.cost0.28Weight for median cost (lower is better)
weights.duration0.12Weight for median duration (lower is better)

All weights must be >= 0 and must sum to 1.0. When the ranking section is omitted, the default weights apply.

Retry

Configure retry behavior for failed sample runs. By default, each sample runs once with no retries.

yaml
defaults:
  retry:
    max_attempts: 3
    backoff: exponential
    delay: 2s
    on: transient
FieldDefaultDescription
max_attempts1Total attempts including the first. 1 means no retries
backofffixedfixed or exponential. Exponential doubles the delay each attempt
delay2sBase delay between retries (Go duration: 500ms, 2s, 1m)
ontransienttransient or all. Controls which failures trigger retries

Retry Modes

  • transient — Retry on runner errors, timeouts, and network failures. Don't retry if the agent ran successfully but produced incorrect output.
  • all — Retry on any non-pass outcome, including correctness failures. Useful for flaky evals or giving the agent another chance.

Backoff

  • fixed — Waits the same delay between each attempt (with ±25% jitter).
  • exponential — Doubles the delay each attempt: delay, 2×delay, 4×delay, etc. (with ±25% jitter).

Result Selection

When multiple attempts are made, the best result is kept (not the last). Pass beats fail, fail beats error, and lower cost breaks ties.

Override Precedence

Retry config inherits via the standard precedence: treatment > eval > defaults.

yaml
defaults:
  retry:
    max_attempts: 2
evals:
  - id: flaky-eval
    retry:
      max_attempts: 5        # overrides defaults for this eval
      on: all
    treatments:
      control:
        name: baseline
      variations:
        - name: experimental
          retry:
            max_attempts: 3   # overrides eval for this treatment

Runner Configuration

claude-code

yaml
runner_config:
  allowed_tools:
    - "Read"
    - "Write"
    - "Bash"
  disallowed_tools:
    - "WebSearch"
  mcp_config: "./mcp.json"
  max_budget_usd: 1.0

ollama

yaml
runner: ollama
runner_config:
  temperature: 0.7
  num_ctx: 4096
  num_predict: 2048
  top_p: 0.9
  top_k: 40
  seed: 42
  stop: ["\n\n"]
  think: true

File References

Evals can be split into separate files:

yaml
evals:
  - file: ./evals/fizzbuzz.yaml
  - file: ./evals/sorting.yaml
  - id: inline-eval
    prompt: "..."

The referenced YAML files contain the eval definition without the evals wrapper.