Testing Overview
Mimic automatically generates test scenarios grounded in the data it creates. Every testable fact — an overdue invoice, a spending anomaly, a missing record — becomes a scenario with natural-language input and concrete assertions. No prompt templates, no hand-written test data.
The pipeline has three stages:
- Facts — During
mimic run, the LLM generates a set of testable facts alongside the persona data. These are written to.mimic/fact-manifest.json. - Scenarios — During
mimic test, a single LLM call converts facts into test scenarios with natural questions and data-specific assertions. - Export — Scenarios can be exported to external eval platforms (PromptFoo, Braintrust, LangSmith, Inspect AI) or Mimic’s own format.
Facts & the Fact Manifest
A fact is a structured, testable statement about the generated data. Facts are created by the LLM during blueprint generation and describe anomalies, trends, risks, and integrity issues that an AI agent should be able to reason about.
{
"persona": "growth-saas",
"domain": "Multi-platform SaaS billing",
"facts": [
{
"id": "fact_001",
"type": "overdue",
"platform": "chargebee",
"severity": "critical",
"detail": "3 overdue invoices totalling £12,400. Oldest is 34 days overdue.",
"data": {
"count": 3,
"total_gbp": 12400,
"oldest_days_overdue": 34
}
}
]
}
Fact types
| Type | Description | Example |
|---|---|---|
anomaly | Unexpected deviation from normal patterns | Mobile MRR down 23% due to App Store outage |
overdue | Items past their due date | 3 invoices totalling £12,400 overdue |
pending | Items awaiting settlement or completion | £8,400 direct debit pending bank settlement |
integrity | Data consistency issues across systems | 34 users with paid flags but no billing record |
growth | Notable growth trends or patterns | EU segment up 31% MoM driven by German market |
risk | Churn risk or other business risks | 14 Pro customers inactive for 30+ days |
Severity levels
Each fact has a severity that maps to a scenario tier:
| Severity | Scenario Tier | Max Latency | Purpose |
|---|---|---|---|
info | smoke | 10s | Agent surfaces basic information correctly |
warn | functional | 20s | Agent handles nuanced or multi-step queries |
critical | adversarial | 15s | Agent handles tricky edge cases without hallucinating |
Auto-Scenario Generation
When auto_scenarios: true is set in mimic.json, mimic test reads the fact manifest and sends all facts to the LLM in a single batched call. The LLM generates one scenario per fact, each with:
- A natural-language question a user would realistically ask
response_containsassertions using specific values from the fact data (numbers, IDs, dates)response_excludeshallucination guards — phrases the agent must not saynumeric_rangeassertions with ±10% tolerance for numeric facts
This is adapter-agnostic — the LLM reads each fact’s detail field and generates appropriate questions regardless of whether the data comes from Stripe, a Postgres database, or a future adapter.
# Enable in mimic.json:
# "test": { "agent": "...", "auto_scenarios": true }
# Then run:
$ mimic run # generates data + fact manifest
$ mimic test # generates scenarios from facts, then runs them
Filtering by tier
Use --tier to limit which scenarios are generated:
# Only smoke tests (info-severity facts)
$ mimic test --tier smoke
# Smoke + functional (skip adversarial)
$ mimic test --tier smoke functional
Or set it in the config with scenario_tiers:
"test": {
"agent": "http://localhost:3000/chat",
"auto_scenarios": true,
"scenario_tiers": ["smoke", "functional"]
}
Exporting Scenarios
Auto-generated scenarios can be exported to external eval platforms or Mimic’s own format using --export:
$ mimic test --export promptfoo # PromptFoo YAML config
$ mimic test --export braintrust # Braintrust dataset + scorer
$ mimic test --export langsmith # LangSmith dataset + evaluator
$ mimic test --export mimic # Mimic native JSON
$ mimic test --inspect # Inspect AI Python task
All exported files are written to .mimic/exports/. If manual scenarios are defined in mimic.json, they are also run after the export.
mimic (native format)
Exports scenarios as a JSON array matching the test.scenarios shape in mimic.json. You can paste these directly into your config or load them as a standalone file.
[
{
"name": "chargebee-overdue-critical-invoices",
"persona": "growth-saas",
"goal": "Agent surfaces the 34-day overdue invoice as highest priority",
"input": "What overdue invoices do we have in Chargebee?",
"expect": {
"response_contains": ["£12,400", "34 days", "inv_p1_cb_overdue_001"],
"response_excludes": ["no overdue invoices", "all paid"],
"numeric_range": { "field": "total_overdue_gbp", "min": 11160, "max": 13640 },
"max_latency_ms": 15000
},
"metadata": {
"tier": "adversarial",
"source_fact": "fact_001",
"platform": "chargebee"
}
}
]
PromptFoo
Generates a promptfooconfig.yaml with contains, not-contains, and javascript assertions. Ready to run with npx promptfoo eval.
tests:
- description: "chargebee-overdue-critical-invoices [adversarial]"
vars:
question: "What overdue invoices do we have in Chargebee?"
assert:
- type: contains
value: "£12,400"
- type: not-contains
value: "no overdue invoices"
- type: javascript
value: |
const nums = output.match(/[\d,]+\.?\d*/g) || [];
return nums.some(n => {
const v = parseFloat(n.replace(/,/g, ''));
return v >= 11160 && v <= 13640;
});
Braintrust
Generates a braintrust-dataset.jsonl (one JSON object per line) and a braintrust-scorer.ts TypeScript scorer file for use with the Braintrust eval framework.
LangSmith
Generates three files:
langsmith-dataset.json— the dataset definitionlangsmith-upload.ts— script to upload the dataset to LangSmithlangsmith-evaluator.ts— evaluator functions for each assertion type
Inspect AI
Generates a self-contained inspect_task.py Python file with an inline dataset and custom scorer. Run it with inspect eval inspect_task.py.
Configuration Reference
All auto-scenario settings live in the test block of mimic.json:
"test": {
"agent": "http://localhost:3000/chat",
"auto_scenarios": true,
"scenario_tiers": ["smoke", "functional", "adversarial"],
"export": "promptfoo",
"scenarios": [
// manual scenarios are merged with auto-generated ones
]
}
| Field | Type | Default | Description |
|---|---|---|---|
auto_scenarios | boolean | false | Enable auto-scenario generation from fact manifest |
scenario_tiers | array | all tiers | Limit to "smoke", "functional", and/or "adversarial" |
export | string | — | Default export format: "mimic", "promptfoo", "braintrust", "langsmith", "inspect" |
CLI flags (--tier, --export, --inspect) override the config values.
End-to-End Example
A complete workflow using the CFO Agent example:
# 1. Generate data with facts
$ mimic run
# → .mimic/data/growth-saas.json
# → .mimic/fact-manifest.json (11 facts)
# 2. Seed databases
$ mimic seed
# 3. Start mock servers
$ mimic host
# 4. Export auto-generated scenarios to PromptFoo
$ mimic test --export promptfoo
# → .mimic/exports/promptfooconfig.yaml
# 5. Or run scenarios directly against the agent
$ mimic test --ci
# → runs 11 auto + 2 manual scenarios
# → exit code 1 if any fail
mimic test --export mimic --ci in your pipeline to both export scenarios for review and fail the build if the agent doesn't pass.