- Timeline
- 1-2 weeks
- Visual motif
- Reasoning orbit
- Live datum
- A message is classified, noted, then handed to a human when needed.
Prompt Regression Tests
High AI Agent system
A CI-style test suite that pins agent behavior so a prompt or model change cannot silently break what already worked. Every fixed scenario asserts the right outcome (correct extraction, correct refusal, correct escalation), and the suite runs before any change ships.
Timeline 1-2 weeks
HMX Zone
ai agent system
High Agents system
Verified HMX-owned system details.
operating facts
Outcome
Prompt and model changes ship with confidence, and previously fixed failures stay fixed instead of quietly returning.
Main risk
Tests cover only easy cases, so a change passes CI but breaks edge behavior like refusals or tool selection.
Prevention
Include negative and edge cases (must-refuse, must-escalate, must-use-tool) and grow the suite from real incidents.
Fallback
If a model upgrade fails the suite, pin to the prior model/prompt version until the regressions are resolved.
system architecture
Prompt Regression Tests Architecture
- 01Capture a golden set of
A CI-style test suite that pins agent behavior so a prompt or model change cannot silently break what already worked.
- 02Encode assertions in a
Encode assertions in a harness (deterministic checks plus LLM-rubric grading) over the agent's responses
- 03Promptfoo
Promptfoo runs the bounded conversation step for Prompt Regression Tests while keeping tool use, transcripts, and escalation outcomes explicit.
- 04OpenAI
Run the suite on every prompt/model change and block the change on regressions
- 05Human Escalation
If a model upgrade fails the suite, pin to the prior model/prompt version until the regressions are resolved.
- 06Agent Handoff
Prompt and model changes ship with confidence, and previously fixed failures stay fixed instead of quietly returning.
how it is built
- 01Capture a golden set of scenarios with expected outputs, including refusals and escalations, not just happy paths
- 02Encode assertions in a harness (deterministic checks plus LLM-rubric grading) over the agent's responses
- 03Run the suite on every prompt/model change and block the change on regressions
- 04Add each new production failure to the suite so the same bug cannot return
architecture notes
Architecture overview
Prompt Regression Tests uses a bounded agent handoff layer for AI Agents. A CI-style test suite that pins agent behavior so a prompt or model change cannot silently break what already worked. The architecture connects capture a golden set of, promptfoo, openai, and agent handoff with an explicit control path.
- Conversation layer: Capture a golden set of scenarios with expected outputs, including refusals and escalations, not just happy paths
- Reasoning layer: Encode assertions in a harness (deterministic checks plus LLM-rubric grading) over the agent's responses
- Tools layer: Promptfoo runs the bounded conversation step for Prompt Regression Tests while keeping tool use, transcripts, and escalation outcomes explicit.
- Records layer: OpenAI connects calls, messages, calendar work, or CRM writes while include negative and edge cases (must-refuse, must-escalate, must-use-tool) and grow the suite from real incidents.
- Escalation layer: Prompt and model changes ship with confidence, and previously fixed failures stay fixed instead of quietly returning.
Data flow
- Capture a golden set of scenarios with expected outputs, including refusals and escalations, not just happy paths
- Encode assertions in a harness (deterministic checks plus LLM-rubric grading) over the agent's responses
- Run the suite on every prompt/model change and block the change on regressions
- Add each new production failure to the suite so the same bug cannot return
Controls and fallbacks
- Tests cover only easy cases, so a change passes CI but breaks edge behavior like refusals or tool selection.
- Include negative and edge cases (must-refuse, must-escalate, must-use-tool) and grow the suite from real incidents.
- If a model upgrade fails the suite, pin to the prior model/prompt version until the regressions are resolved.
Tools
- Promptfoo
- OpenAI
- Vapi
- Retell
research basis
back
start
Build this system around your real handoffs.
The intake captures tools, failure points, access, and owner rules before scope is confirmed.