Testing & Auto-Scenarios

Testing Overview

Mimic automatically generates test scenarios grounded in the data it creates. Every testable fact — an overdue invoice, a spending anomaly, a missing record — becomes a scenario with natural-language input and concrete assertions. No prompt templates, no hand-written test data.

The pipeline has three stages:

Facts — During mimic run, the LLM generates a set of testable facts alongside the persona data. These are written to .mimic/fact-manifest.json.
Scenarios — During mimic test, a single LLM call converts facts into test scenarios with natural questions and data-specific assertions.
Export — Scenarios can be exported to external eval platforms (PromptFoo, Braintrust, LangSmith, Inspect AI) or Mimic’s own format.

Facts & the Fact Manifest

A fact is a structured, testable statement about the generated data. Facts are created by the LLM during blueprint generation and describe anomalies, trends, risks, and integrity issues that an AI agent should be able to reason about.

json.mimic/fact-manifest.json (excerpt)

{
  "persona": "growth-saas",
  "domain": "Multi-platform SaaS billing",
  "facts": [
    {
      "id": "fact_001",
      "type": "overdue",
      "platform": "chargebee",
      "severity": "critical",
      "detail": "3 overdue invoices totalling £12,400. Oldest is 34 days overdue.",
      "data": {
        "count": 3,
        "total_gbp": 12400,
        "oldest_days_overdue": 34
      }
    }
  ]
}

Fact types

Type	Description	Example
`anomaly`	Unexpected deviation from normal patterns	Mobile MRR down 23% due to App Store outage
`overdue`	Items past their due date	3 invoices totalling £12,400 overdue
`pending`	Items awaiting settlement or completion	£8,400 direct debit pending bank settlement
`integrity`	Data consistency issues across systems	34 users with paid flags but no billing record
`growth`	Notable growth trends or patterns	EU segment up 31% MoM driven by German market
`risk`	Churn risk or other business risks	14 Pro customers inactive for 30+ days

Severity levels

Each fact has a severity that maps to a scenario tier:

Severity	Scenario Tier	Max Latency	Purpose
`info`	`smoke`	10s	Agent surfaces basic information correctly
`warn`	`functional`	20s	Agent handles nuanced or multi-step queries
`critical`	`adversarial`	15s	Agent handles tricky edge cases without hallucinating

Auto-Scenario Generation

When auto_scenarios: true is set in mimic.json, mimic test reads the fact manifest and sends all facts to the LLM in a single batched call. The LLM generates one scenario per fact, each with:

A natural-language question a user would realistically ask
response_contains assertions using specific values from the fact data (numbers, IDs, dates)
response_excludes hallucination guards — phrases the agent must not say
numeric_range assertions with ±10% tolerance for numeric facts

This is adapter-agnostic — the LLM reads each fact’s detail field and generates appropriate questions regardless of whether the data comes from Stripe, a Postgres database, or a future adapter.

bash

# Enable in mimic.json:
# "test": { "agent": "...", "auto_scenarios": true }

# Then run:
$ mimic run          # generates data + fact manifest
$ mimic test         # generates scenarios from facts, then runs them

Filtering by tier

Use --tier to limit which scenarios are generated:

bash

# Only smoke tests (info-severity facts)
$ mimic test --tier smoke

# Smoke + functional (skip adversarial)
$ mimic test --tier smoke functional

Or set it in the config with scenario_tiers:

json

"test": {
  "agent": "http://localhost:3000/chat",
  "auto_scenarios": true,
  "scenario_tiers": ["smoke", "functional"]
}

Exporting Scenarios

Auto-generated scenarios can be exported to external eval platforms or Mimic’s own format using --export:

bash

$ mimic test --export promptfoo    # PromptFoo YAML config
$ mimic test --export braintrust   # Braintrust dataset + scorer
$ mimic test --export langsmith    # LangSmith dataset + evaluator
$ mimic test --export mimic        # Mimic native JSON
$ mimic test --inspect             # Inspect AI Python task

All exported files are written to .mimic/exports/. If manual scenarios are defined in mimic.json, they are also run after the export.

mimic (native format)

Exports scenarios as a JSON array matching the test.scenarios shape in mimic.json. You can paste these directly into your config or load them as a standalone file.

json.mimic/exports/mimic-scenarios.json (excerpt)

[
  {
    "name": "chargebee-overdue-critical-invoices",
    "persona": "growth-saas",
    "goal": "Agent surfaces the 34-day overdue invoice as highest priority",
    "input": "What overdue invoices do we have in Chargebee?",
    "expect": {
      "response_contains": ["£12,400", "34 days", "inv_p1_cb_overdue_001"],
      "response_excludes": ["no overdue invoices", "all paid"],
      "numeric_range": { "field": "total_overdue_gbp", "min": 11160, "max": 13640 },
      "max_latency_ms": 15000
    },
    "metadata": {
      "tier": "adversarial",
      "source_fact": "fact_001",
      "platform": "chargebee"
    }
  }
]

PromptFoo

Generates a promptfooconfig.yaml with contains, not-contains, and javascript assertions. Ready to run with npx promptfoo eval.

yaml.mimic/exports/promptfooconfig.yaml (excerpt)

tests:
  - description: "chargebee-overdue-critical-invoices [adversarial]"
    vars:
      question: "What overdue invoices do we have in Chargebee?"
    assert:
      - type: contains
        value: "£12,400"
      - type: not-contains
        value: "no overdue invoices"
      - type: javascript
        value: |
          const nums = output.match(/[\d,]+\.?\d*/g) || [];
          return nums.some(n => {
            const v = parseFloat(n.replace(/,/g, ''));
            return v >= 11160 && v <= 13640;
          });

Braintrust

Generates a braintrust-dataset.jsonl (one JSON object per line) and a braintrust-scorer.ts TypeScript scorer file for use with the Braintrust eval framework.

LangSmith

Generates three files:

langsmith-dataset.json — the dataset definition
langsmith-upload.ts — script to upload the dataset to LangSmith
langsmith-evaluator.ts — evaluator functions for each assertion type

Inspect AI

Generates a self-contained inspect_task.py Python file with an inline dataset and custom scorer. Run it with inspect eval inspect_task.py.

Configuration Reference

All auto-scenario settings live in the test block of mimic.json:

json

"test": {
  "agent": "http://localhost:3000/chat",
  "auto_scenarios": true,
  "scenario_tiers": ["smoke", "functional", "adversarial"],
  "export": "promptfoo",
  "scenarios": [
    // manual scenarios are merged with auto-generated ones
  ]
}

Field	Type	Default	Description
`auto_scenarios`	boolean	`false`	Enable auto-scenario generation from fact manifest
`scenario_tiers`	array	all tiers	Limit to `"smoke"`, `"functional"`, and/or `"adversarial"`
`export`	string	—	Default export format: `"mimic"`, `"promptfoo"`, `"braintrust"`, `"langsmith"`, `"inspect"`

CLI flags (--tier, --export, --inspect) override the config values.

End-to-End Example

A complete workflow using the CFO Agent example:

bash

# 1. Generate data with facts
$ mimic run
#   → .mimic/data/growth-saas.json
#   → .mimic/fact-manifest.json (11 facts)

# 2. Seed databases
$ mimic seed

# 3. Start mock servers
$ mimic host

# 4. Export auto-generated scenarios to PromptFoo
$ mimic test --export promptfoo
#   → .mimic/exports/promptfooconfig.yaml

# 5. Or run scenarios directly against the agent
$ mimic test --ci
#   → runs 11 auto + 2 manual scenarios
#   → exit code 1 if any fail

Combine with CI: Use mimic test --export mimic --ci in your pipeline to both export scenarios for review and fail the build if the agent doesn't pass.