
Imagine your team is building an AI app that generates recipes based on available items. The agent goes through multiple steps to get the right answer. It takes in the list of ingredients, reviews the user's preferences, and browses local dishes. Then, it generates cooking instructions that don't include any extra ingredients. Finally, the model outputs a helpful recipe.
The question is, what will the QA strategy for such an app be? How do you ensure the LLM generates relevant recipes that are also safe and edible?
In this guide, I will explain why your team cannot use traditional QA methods to test AI projects. You will also learn about evals and how you can build a non-deterministic test suite. Finally, you'll see how QA.tech tests an AI app's user experience.
Let's begin.
Determinism Assumption QA Was Built On
Traditional automated QA was built on the assumption that software is predictable. The same lines of code should yield the same result every time. Give a function a specific input, and you should always expect the same output. If this output varies even slightly, there’s probably a bug somewhere.
Assertions proved important in unit tests (for functions and classes), in integration tests (for checking how different parts of an app share data), and in end-to-end tests (for the UI). Since all this was fairly deterministic, exact assertions were generally good enough.
But with AI-powered apps, you can’t really do QA the old-fashioned way. LLM applications are inherently non-deterministic. Because they use NLP and have in-built reasoning modes, their outputs are considered probabilistic. In other words, you cannot control exactly what comes out of an LLM.
Take a chatbot, for example. For the same question, the agent can return very different answers, all of which are correct. And that immediately breaks assertions, snapshot tests, and regression baselines.
Evals: Test for Acceptable Behavior, Not Exact Output
Evals are how you test your AI agent. They are different from assertions because instead of checking whether “the output is equal to X”, they check if “the output satisfies constraints Y." They test behavioral bounds rather than exact values only.
According to Greg Brockman of OpenAI, who’s been quoted on this to hell and back, “evals are surprisingly often all you need.” The idea behind this is simple: your agent won't automatically perform better with more advanced prompts. Even if you are a top-tier prompt engineer, the model can still hallucinate, have reasoning faults, misinterpret requests, ignore certain instructions, or suffer data drifts. Also, how do you objectively determine that a particular prompt is better than the other?
Evals remove a lot of guesswork and vibe-testing by using objective metrics that accurately predict model and prompt performance. Take a product recommendation agent, for example. Your QA team can use evals to test for:
-
Accuracy: Response must mention a product in the correct category (for example, a sweatshirt for tops).
-
Hallucination: The agent must not hallucinate products that don't exist in the catalog.
-
No competitors: The agent must not recommend products from competitors.
-
Awareness: The agent must understand context. For instance, it should know when to recommend one product and when the user wants multiple options instead.
-
Reasoning: The agent understands user requests and can pick relevant products.
-
Safety and ethics: The agent does not use banned words or speak in an inappropriate tone.
There are other metrics, such as over-refusals and the degree to which the agent follows instructions. The exact set you will use depends heavily on your project.
Furthermore, evals can target specific steps in a workflow. Remember our AI app that goes through multiple steps to create a recipe? If something goes wrong at any stage, the final result is affected.
With evals, you can define constraints for each step and test the model as it continuously reasons through a task. You can verify that it safely parses user input, only retrieves relevant local dishes, or does not hallucinate ingredients when creating a new recipe.
Finally, eval scores allow you to define reliable regression baselines, so that you can test model and prompt updates in your CI/CD pipeline.
4 Practical Testing Strategies for AI-Powered Apps
Teams deploy evals in different modes to test LLMs effectively. In some cases, you can check if an output contains a close-enough variation of an expected answer; in others, you can even use another LLM to run evals for you.
Here are 4 proven strategies for running scalable tests for your AI application.
Semantic Similarity Scoring
Semantic similarity scoring measures how different responses to the same prompt vary from an ideal answer. It compares text strings generated by an AI agent to a reference output.
The process is quite simple. QA engineers run the same prompt 5 to 10 times, embed the outputs, and measure cosine similarity. If variance exceeds a set threshold, there's a problem.
The score is often between 0 and 1, with everything above 0.9 considered reliable. Scores above 0.80 are seen as acceptable, while anything below 0.70 should not be pushed. However, it’s important to note that the thresholds you use will depend on your use case.
Also, even if the scores you get in between responses consistently stay above this threshold, they should not be too different from each other. High variance generally indicates that the model performs inconsistently across prompts, so you can't really trust it. In a nutshell, run the eval prompt multiple times, collect the results, and aim for a cosine similarity close to 1.
Some tools, including Braintrust, Arize, LangChain, Giskard, and OpenAI's Evals, have in-built semantic similarity scoring functions. You can also run custom embeddings and calculate similarities yourself.
LLM-as-Judge
For this strategy, you get a separate LLM to evaluate whether an output meets the criteria. You define evals and have the LLM run at scale for hundreds or thousands of outputs.
This method is more nuanced than string matching or similarity scoring because the judge LLM reasons through answers in natural language. It can tell when an answer is polite, hallucinated, or factual.
A judge LLM is also more affordable than hiring tons of manual annotators, who are also bound to be slower. On top of that, it can evaluate complex, multi-step reasoning tasks.
Examples: LLM-as-Judge Evals for a Recipe App
**Sample User Prompt**
“I have rice, green peas, onions, and pepper. I’m vegetarian and want a local recipe.”
**Expected Constraints**
- Must not include meat
- Must only use listed ingredients plus common pantry items
- Must recommend a locally relevant recipe
- Instructions must be simple-to-follow
- Response tone must remain helpful
**Sample Eval to Test for Hallucination**
You are evaluating a recipe-generation AI. Determine whether the agent hallucinated ingredients that were not provided by the user or are not considered common pantry items. Allowed pantry items are "salt", "water", "cooking oil", and "seasoning". Return "PASS if no hallucinated ingredients appear", "FAIL if other ingredients were introduced", "a short explanation".
**Sample Eval for Recipe Coherence**
You are evaluating whether a generated recipe is coherent and actionable for a normal user. Determine "whether the cooking steps are logically ordered", "whether the instructions are simple", whether the recipe is realistic", and "whether important preparation steps are missing". Return "a score from 1–5" and "a short explanation".
For the best results, use a different LLM from the one you are testing. Additionally, pick a model with reasoning capabilities and define clear scoring rubrics for evaluating responses. For instance, you can ask the judge to return a single response or score from 1 to 5.
Guardrail Testing
Guardrails ensure your AI agent is not returning categorically false or unhelpful responses. Instead of testing for the "right" answer, you’re testing for the absence of the wrong ones, and, at the same time, ruling out deal-breaking faults.
In addition, guardrail testing includes checking for PII (Personally Identifiable Information) leakage. This refers to SSNs, addresses, names, and financial data. You can also implement guardrails to prevent hallucination, off-brand tones, and harmful content.
Guardrails are particularly useful in regression testing, as they help you prevent shipping a subpar agent to production. If you want your test results to be effective, make sure to come up with challenging prompts that push the model to its limits. in many cases, a simple pass/fail result should do.
Adversarial Testing or Red Teaming
Even though adversarial testing also deals with guardrails, it works differently.
Namely, while guardrail testing confirms that a model doesn't go off the set path, adversarial testing or red teaming is all about devising novel ways to break the model. This involves feeding it prompt injections, jailbreaks, or indirect attacks. It also includes manipulating the model to reveal information it was instructed never to show.
Red teaming helps you discover security vulnerabilities in your agent. Once your QA team discovers an exposed area, they can alert the dev team to create a new guardrail. This can serve as protection against similar attacks in the future.
How to Build a Non-Deterministic Test Suite
To build a full-coverage non-deterministic test suite, start by defining test scenarios. Then, create evals for each.
1. Create a Structure
Let's go back to our custom recipe app.
To create a test suite, first define the scenarios. For instance:
-
User sends in a list of ingredients.
-
User modifies already sent ingredients.
-
User asks the agent to modify a recipe.
-
User asks the agent to write the recipe to a downloadable file.
-
User is frustrated with the agent.
Define multiple evals for each scenario to test for selected criteria. If our app needs to test accuracy, hallucination, PII leakage, and tone, each scenario will have four eval prompts, one per metric. And then, you get a final score indicating whether the agent passed that single test run.
If you're integrating into your CI/CD pipeline, you can set a minimum pass rate of 8 out of 10 runs. New changes can only be pushed when the agent exceeds the pass rate.
2. Test in Batches
A single passing test is not a trend. For each scenario, run the test cases multiple times and in batches across a large number of outputs. Repeating test runs and getting similar results gives you confidence in your agent.
3. Log Results
Track your metrics over time to uncover trends in your model. Maybe an agent maintains a 95% pass rate for a few months, and then an update causes it to drop to 89%. Even though the runs are still above the pass rate of 80%, you will know that the new rate is a regression that will perform worse than your current system. In that case, you’ll know you shouldn’t push changes.
Enter QA.tech
QA.tech handles end-to-end user experience testing. Through agentic testing, it navigate apps the way real users would. It interacts with the UI layer of AI-powered applications, regardless of backend non-determinism. For example, the agent can test an AI chatbot to ensure it still reaches a resolution, even if responses vary.
The agent autonomously interacts with your app's AI, receives responses, and evaluates their quality.

QA.tech testing a chatbot
In Conclusion
Non-deterministic evaluation is the only way to test your AI application effectively. Since LLMs and agents use NLP and reasoning to complete tasks, you also need a framework that applies similar methods to judge results. (It should be noted that nothing covered in this article is harder than automated testing. You just need a mindset shift, and you’ll be good to go.)
However, as important as eval infra tools like Braintrust and Giskard are, there’s still the user’s end of the application. That's where QA.tech comes in. The agent interacts with your app's AI and judges responses based on whether they help users complete their flows.
Ultimately, user satisfaction is the most important thing, and verifying how your app performs for a user is a must.
Book a demo to start.