Test Upgrade: Build your AI agent test suite from real conversations

Test Upgrade: Build your AI agent test suite from real conversations

Test cases built from synthetic inputs drift from what customers actually do, and writing them one by one is slow. Once written, they can only run against the environment they were built for, so validating across development, staging, and production means duplicating work. When a test fails, a pass/fail result tells you something broke but not where to look. And as test sets grow, they become harder to manage without a way to tag, group, or run individual cases.

The mismatch between your test sets and production reality is where trust in test results erodes.

Test Upgrade closes the gap between the test cases you maintain and the agent behavior you actually care about — with bulk creation from real conversations, cross-environment portability, and failure analysis that gives you a starting point, not just a result.

Test result details

How it works

  • Bulk test case creation from conversations: Select multiple conversations directly from the Conversation list and convert them to test cases in a single action.
  • AI-generated test cases from conversation IDs: When creating a new test case, choose “User conversation” as the source, enter a Conversation ID, and click Generate. The AI analyzes the conversation and pre-fills all test materials, so you're refining a concrete starting point rather than building from scratch.
  • Labels, bulk selection, and single run: Tag test cases with labels for grouping and filtering. Multi-select cases for batch operations. Run any single case with one click directly from the list, without setting up a full test run.
  • Cross-environment test set sharing: A test set now runs against any environment (development, staging, or production) without duplication. Create once, validate everywhere. Scheduled runs and results display which environment was used.
  • Failure analysis for test sets: When a run contains failures, an AI-generated summary explains what went wrong across failed cases, giving you a starting point for debugging without inspecting each result individually.

Important notes

  • Message Match removed: The Message Match test type is no longer available. Existing Message Match test cases will not be accessible. All testing now uses Conversation Check.
  • Re-run to establish baseline: The evaluation logic has been upgraded, We recommend to re-run your test sets after the update to establish a new baseline. Tests that previously passed may surface different results if prior results were unreliable.

Test Upgrade is part of Trust OS, giving teams a test suite that stays grounded in real customer behavior, runs across any environment, and surfaces failures with enough context to act on.