Fixture Development Process¶
Related: Fixture format · agentcarousel.com
Templates:templates/fixture-skeleton.yaml·templates/fixture-bundle.manifest.json· Tag examples
Purpose¶
This document describes a process for authoring agentcarousel fixtures that are correct, reproducible, and ready for submission. The emphasis is on speed with guardrails: most standard-tier fixtures should go from blank intake to passing CI in under two hours.
Skip ahead to the Quick-start checklist if you are iterating on an existing fixture.
Tag vocabulary¶
Use these tags in new fixtures:
smoke(fast PR gate)happy-path(core success scenario)error-handlingedge-casecertificationdeferred
Roles¶
| Role | Responsibility |
|---|---|
| Author | Designs scenarios, writes YAML, owns mocks, runs validate + test locally |
| Reviewer | Peer-checks correctness and compliance; domain reviewer for certification track |
| Operator | Ingests evidence packs, tracks trust state, assigns auditors |
| Domain Auditor | Human expert who reviews Stable agents before Trusted attestation |
Workflow¶
Phase 1 — Intake (15–30 minutes)¶
Complete the fixture proposal checklist in CONTRIBUTING.md before writing any YAML. That prevents scope creep and surfaces data sensitivity issues early.
Minimum questions to answer:
| Question | Why it matters |
|---|---|
What is the skill_or_agent id? |
Determines the fixture file name and case id prefix |
| What user goal does this scenario test? | One clear goal per fixture file; edge cases are separate cases entries |
| Are any tool calls required? | Drives expected.tool_sequence design |
What is the risk tier? (low / medium / high) |
High-risk fixtures require a domain reviewer at review phase |
| Is input data synthetic? | Real PII in fixtures is never acceptable; synthetic data required |
Intake gate: Do not proceed to Phase 2 until the checklist is complete and the Author has confirmed: (a) no PII in inputs, (b) scope is bounded, (c) mocks can be written without live network calls.
Phase 2 — Scenario Design (30–60 minutes)¶
Design one primary use case per fixture file. A primary use case is the happy-path: the most important thing the skill/agent should do when everything works correctly.
Add edge cases as separate cases entries within the same file, not as separate files, unless the edge case covers a substantially different workflow.
Scenario structure pattern:
fixture file
├── case: happy-path (smoke, happy-path tags)
├── case: edge-case-A (edge-case tag)
├── case: edge-case-B (edge-case tag)
└── case: failure-mode (error-handling tag)
Pairing rule: Always author the happy-path and at least one failure-mode case together in Phase 2. The failure-mode case almost always reveals mock gaps that are cheaper to fix before the mocks are written than after.
Phase 3 — Author (time varies by complexity)¶
Use agentcarousel init as your starting point. Never write a fixture from a blank file.
# Scaffold from init, then replace with your values:
agentcarousel init --skill summarize-skill > fixtures/skill-summarize.yaml
# Or copy the annotated template:
cp templates/fixture-skeleton.yaml fixtures/my-new-fixture.yaml
Author checklist (run through before Phase 4):
- [ ] Every
casehas anidin<fixture-stem>/<case-name>format - [ ] Every
casehas adescription(not optional, even though schema allows it) - [ ] Every
casehas at least onetag; happy-path cases includesmoke - [ ]
expected.tool_sequenceis present even when empty ([]) — makes intent explicit - [ ] At least one
outputassertion per case - [ ] Every rubric item has a
weight; weights sum to1.0across rubric items - [ ] Rubric items that cannot be auto-checked are documented with a comment explaining what a judge or human reviewer should look for
- [ ] Mock files referenced by
--mock-dircover every tool call inexpected.tool_sequence - [ ]
timeout_secsis set to a realistic upper bound (not the default, not 999)
Evaluator selection hierarchy (use the first that works; escalate only when needed):
rules— exact match, regex, JSON path, tool sequence count. Free, deterministic, fast.golden— diff against a known-good output file. Use when output format is stable and you have a reference output.process— external script (Python, JS). Use when you need custom logic that doesn't fit rules/golden.judge— LLM-as-judge. Use only for rubric items that genuinely require language understanding and cannot be expressed as any of the above. LLM judge adds cost, variance, and API dependency; minimize its use.
Phase 4 — Self-Check (10–20 minutes)¶
Run all three checks locally before requesting a review. Do not skip --mock-strict.
# 1. Schema validation — must exit 0
agentcarousel validate fixtures/my-new-fixture.yaml --strict
# 2. Offline test with strict mock enforcement
agentcarousel test fixtures/my-new-fixture.yaml \
--offline \
--mock-dir mocks/ \
--mock-strict
# 3. Eval pass (if rubric items exist)
agentcarousel eval fixtures/my-new-fixture.yaml \
--mock-dir mocks/ \
--offline
# 4. Inspect the run
agentcarousel report show $(agentcarousel report list --limit 1 --json | jq -r '.[0].id')
Common self-check failures and fixes:
| Symptom | Likely cause | Fix |
|---|---|---|
validate exit 2, missing field: expected |
Forgot expected: block |
Add expected: {tool_sequence: [], output: []} minimum |
test exit 1, tool call not matched |
Mock args don't match args_match partial spec |
Check mock file field names; args_match is a partial JSON match |
test exit 4, timeout |
timeout_secs too low for mock latency |
Increase timeout_secs; check mock response time |
eval exit 1, effectiveness score below threshold |
Rubric weights don't reflect actual pass criteria | Adjust weight or tighten auto_check assertion |
| Flaky: passes sometimes, fails sometimes | Non-deterministic assertion (e.g., regex too loose) | Tighten regex; add seed; use --runs 3 to surface flakiness |
If --mock-strict causes failures because a tool call is unmocked, do not disable --mock-strict. Add the missing mock.
Phase 5 — Peer Review (async)¶
Open a PR or share the fixture file with the reviewer. Use the review checklist below as the PR description template.
Standard review checklist:
## Fixture Review Checklist
**Fixture file:** `fixtures/<name>.yaml`
**Author:** @...
**Reviewer:** @...
### Correctness
- [ ] Case ids are unique and follow `<stem>/<name>` convention
- [ ] Tool sequence expectations match the described behavior
- [ ] Output assertions are necessary (not over-fitted to one specific wording)
- [ ] Edge-case inputs are realistic; not constructed to trivially pass
- [ ] Mocks are plausible; mock responses are not simplified to the point of hiding real failure modes
### Completeness
- [ ] Positive case present
- [ ] At least one failure-mode or edge case present
- [ ] All rubric weights sum to 1.0 per case
- [ ] `description` present on every case
### Safety & data
- [ ] No PII, credentials, or real API keys in fixture inputs, mock responses, or expected outputs
- [ ] `--offline` passes; no undeclared network calls
- [ ] Mocks committed with fixture; fixture does not depend on external state
### For certification track only
- [ ] `bundle_id` and `bundle_version` set in manifest
- [ ] `certification_track: candidate` in manifest
- [ ] `risk_tier` and `data_handling` set correctly
- [ ] Second reviewer (domain expert) has signed off
- [ ] Flake budget: ran `agentcarousel eval --runs 5` locally with 0 failures
Certification track adds: a domain reviewer must be assigned before the PR is merged. The domain reviewer verifies that the scenarios are realistic for the skill's stated domain and that the rubric items correctly represent quality in that domain. This is not the same as the formal agentcarousel audit.
Phase 6 — Freeze and Bundle Version Bump¶
After review is approved and CI is green:
- If the fixture file is part of a bundle, update
bundle_versioninbundle.manifest.json: - Patch bump (
1.2.0→1.2.1): description/comment changes only. - Minor bump (
1.2.0→1.3.0): new cases added that do not remove existing cases. -
Major bump (
1.2.0→2.0.0): cases removed, case ids renamed, or existing assertions made stricter. Major bumps reset the carousel iteration counter to 0 — the agent must re-earn Stable status. -
Recompute
sha256entries inbundle.manifest.json(or runagentcarousel bundle packin M3+). -
Tag the commit if this is a bundle submission to the AGC registry.
Definition of Done¶
A fixture is done when all of the following are true:
- [ ]
agentcarousel validate fixtures/<name>.yaml --strictexits 0 - [ ]
agentcarousel test fixtures/<name>.yaml --offline --mock-dir mocks/ --mock-strictexits 0 - [ ] JSON or JUnit XML is parseable by CI (run once in the pipeline)
- [ ] Every case has
descriptionand at least one tag - [ ] PR reviewed and approved (self-review acceptable for standard tier)
Certification¶
All Standard items, plus:
- [ ] Bundle manifest (
bundle.manifest.json) present, valid, and up-to-date - [ ]
bundle_id,bundle_version,certification_track: candidate,risk_tier,data_handlingall set - [ ]
agentcarousel eval fixtures/<name>.yaml --offline --runs 5 --mock-dir mocks/exits 0 with all 5 runs passing - [ ] Effectiveness score ≥
effectiveness_thresholdacross all 5 runs - [ ] 0 flakes across 5 local eval runs (no intermittent failures)
- [ ] Domain reviewer has approved (second reviewer, separate from Author)
- [ ]
ownerslist in manifest includes at least one GitHub handle - [ ]
policy_versionmatches current AGC policy document version - [ ] Commit sha for this bundle version recorded in PR description
Quick-Start Checklist¶
For authors iterating on an existing fixture (not starting from scratch):
# After making changes:
agentcarousel validate fixtures/my-fixture.yaml --strict
agentcarousel test fixtures/my-fixture.yaml --offline --mock-dir mocks/ --mock-strict
# If rubric changed:
agentcarousel eval fixtures/my-fixture.yaml --offline --runs 3
# Check for regressions against previous run:
agentcarousel report diff <PREV_RUN_ID> $(agentcarousel report list --limit 1 --json | jq -r '.[0].id')
If report diff exits 1, investigate which metric degraded before merging.
Expedited Tactics¶
The following practices dramatically reduce the time from idea to merged fixture:
1. Start from init or the template — never blank YAML.
The template includes all optional fields as comments, preventing forgotten fields during review.
2. Write mocks before assertions.
Draft the mock response first; then write the output assertions against what the mock actually returns. This eliminates the most common test failure: assertions written against an idealized output that doesn't match mock behavior.
3. Use rules evaluator first; escalate to judge only for genuinely ambiguous rubrics.
LLM-judge adds ~1–3 seconds per case invocation plus API cost. Most rubrics are expressible as regex or JSON path. If you find yourself writing a judge for a rubric that could be a regex, rewrite it as a regex.
4. Tag-driven CI reduces full eval to smoke-only on PRs.
Mark edge-case and certification cases with specific tags. Configure CI to run --filter-tags smoke on PRs and full evaluation only on main/nightly. This keeps PR feedback under 30 seconds.
5. Pair happy-path + failure-mode from the start.
Authors who write only the happy-path in Phase 3 and add failure cases later spend 2x as long on Phase 4 because failure cases almost always reveal mock gaps.
6. Keep mock responses minimal but realistic.
A mock that returns an implausibly perfect response will cause assertions to pass locally but fail against a real endpoint. Use realistic (slightly imperfect) responses in mocks: include minor formatting variation, realistic token counts, and occasional tool result delays.
7. Run --mock-strict always.
Any unmocked tool call discovered during review or CI is a fixture authoring error, not a test runner issue. --mock-strict surfaces these early.
Fixture File Naming Convention¶
fixtures/
├── <domain>/
│ ├── <skill-or-agent-id>.yaml # primary fixture file
│ ├── <skill-or-agent-id>-edge.yaml # edge cases (separate file if many)
│ └── <skill-or-agent-id>-stress.yaml # load/stress cases (optional)
└── examples/ # curated examples (maintained by AGC)
└── *.yaml
Case ids must always match the fixture file stem:
- File fixtures/text-processing/skill-summarize.yaml → case ids start with skill-summarize/
- File fixtures/search/agent-web-search.yaml → case ids start with agent-web-search/
This is enforced by agentcarousel validate and the schema.
Common Mistakes to Avoid¶
| Mistake | Why it's a problem | Correct practice |
|---|---|---|
| Using real API responses in mocks | Embeds real data, possibly PII; changes over time | Use synthetic data that matches the schema of the real response |
Setting timeout_secs: 300 |
Masks slow agents; CI takes forever | Set to 1.5× the expected real latency; investigate if exceeded |
| Writing rubric weights that don't sum to 1.0 | eval scoring is incorrect; effectiveness score is meaningless |
Always sum weights to 1.0 per case |
Using kind: equals for LLM output |
LLM output varies by temperature/seed | Use kind: contains or kind: regex; reserve equals for structured/tool outputs |
Omitting tool_sequence: [] for skills with no tool calls |
Ambiguous intent; reviewers don't know if tool calls were forgotten | Always include tool_sequence: [] explicitly for zero-tool-call skills |
Skipping --mock-strict in self-check |
Hidden unmocked calls discovered in CI | Always run --mock-strict locally before requesting review |
Checking in API keys in env_overrides |
Security violation; keys may appear in traces | env_overrides is for non-secret config only; keys must come from environment |