Introduction

The development of autonomous agents poses a unique challenge that other types of applications don’t typically grapple with: heavy reliance on inherently non-deterministic dependencies at multiple points within the system. 

The challenges of a third-party remote dependency aside (“Is it just me or did gpt-4o suddenly get worse/slower/different this week? What changed?”), getting variable outputs from an LLM is kind of a “working as designed” situation. How do you wrangle something so fluid into a reliable and useful real-world application? Developers working with generative AI often find themselves having to tread the very fine line between:

  1. Micromanaging prompts and implementing checks and balances to force the LLM output into something more deterministic, or…
  2. Just allowing the LLM “do what it does best” and deal with the nondeterminism somehow.

What’s more, debugging LLM output often feels like a black box. Even if you host a custom model yourself and have access to all the relevant innards, it’s not exactly a matter of simply attaching your IDE debugger and stepping through what exactly is going on.

What’s a developer to do? For prompts specifically, prompt evaluation can be a good safety rail to add to the testing toolbox.

Prompt evaluation

Tweaking LLM prompts can feel like a guessing game. Would the model perform better with more detail, or less? Should you add more examples, or remove a few? Should you bribe it with $100 or threaten it with Rick Astley? Prompt evaluation tests can be useful to answer exactly these questions. 

Prompt evaluation lets you gauge the impact your prompt changes have on LLM output and track prompt improvement or degradation over time.

At QA.tech, we picked promptfoo as our prompt evaluation framework. We’ve created a generic prompteval package in our monorepo, which all of our services (including our test detection service, testing agent, and others) can consume to evaluate their LLM prompts. 

promptfoo can be used as a CLI, but our current primary use case is to run evaluations from test scripts. Let’s go through an example of how we evaluate tweaks to our agent’s test step update prompt.

QA.tech agent structure

Test stages

As a brief overview, our primary agent goes through a series of steps when running each test: 

  • Reason
  • Act
  • Evaluate
  • Reflect
  • Update Steps

Each of the above stages utilize their own LLM calls, each with their own prompts to evaluate. The agent itself decides which of the above actions to take next. If an Act stage fails, for example, it may decide to go into Reason to figure out the best course of action. 

Each test consists of an optional series of steps. If steps do not yet exist, the agent will figure them out on its own. It will then attempt to codify and refine the steps as needed, for more efficient execution going forward. So, when a test is complete, the test may go into reflection and step update.

Data points

Each stage above has a data point associated with it and stored for future review and refinement. These data points are what we use to test prompt tweaks.

Each data point contains a few key pieces of information:

  • The input that went into the stage (which the LLM prompt is then constructed from)
  • The result of each stage (based on the LLM output)
  • An optional expected result (what we wanted the stage to return)

The expected result is set when we want to give the agent feedback about its performance during a certain stage. If it did well, the expected result matches the actual output. If it did badly, we specify what the output should’ve been. 

Now that we’ve covered the basics, let’s focus on the step update stage specifically.

Evaluating step update prompts

Tagging a data point for testing

If we want to test a specific data point for a stage, we assign a prompteval tag to it in our backend. 

Screenshot: 
Showing the tag "prompteval" with a count of 26

Running the tests

Currently, our prompt evaluation tests are run locally and are structured the way you might a unit or integration test in code. Here’s a closer look at the step update prompt evaluation test:

describe('@prompteval reasonActEvaluate - updateSteps - prompt evaluation - datapoints', async function () {
 it('evaluates datapoints', async function () {
    // Retrieve data points tagged for testing from the database
   const taggedDatapointsToEval = await getTypedAgentDatapointsByTags(
     toEvalDatapointTag,
     'UPDATE_STEPS',
   )
   // Test all tagged data points
   for (const dp of taggedDatapointsToEval) {
     await testDataPoint(dp, () => {
       return getExpectedOutputAsserts(dp.expectedOutput, 'UPDATE_STEPS')
     })
   }
 })

Above, we retrieve all data points with the relevant tag and data point type (UPDATE_STEPS) in this case. We then run prompt evaluation on each of them (concurrency is to be introduced, right now we just run one data point at a time). 

The testDataPoint() function is where our actual prompteval package is invoked:

async function testDataPoint(
 dp: {
   dataType: AgentDatapointDataType
   input: UpdateStepsInput
   expectedOutput: UpdateStepsOutput
 },
 assertFunc: AssertFunction | null,
) {
 const { lockedSteps, unlockedSteps } = splitSteps(dp.input.steps)
 const messages = await getMessages(dp.input, lockedSteps, unlockedSteps)
 const messagesJson = JSON.stringify(messages)
 const obj = JSON.parse(messagesJson)


 const asserts = assertFunc
   ? assertFunc()
   : dp.expectedOutput
     ? getExpectedOutputAsserts(dp.expectedOutput, 'UPDATE_STEPS')
     : defaultAsserts


 const evaluationRes = await evaluatePrompt(
   'Evaluate step update prompt',
   obj,
   [Provider.PortkeyAzure],
   asserts,
 )
 
 assert(evaluationRes)

 for (const res of evaluationRes.results || []) {
   if (res.error) {
     console.error(
       `Prompt eval error: ${res.error}. Output: ${res.response?.output}`,
     )
   }
 }
 expect(evaluationRes.stats.failures).to.equal(0)
}

Above, we prepare our UpdateStepInput type by splitting locked from unlocked steps (we might cover this in a future blog post) and deciding what prompt evaluation asserts we want to test for. If a set of asserts is provided by the caller, we use that. Otherwise, we use a default set of asserts.

Here is an example of one of our asserts for the step update LLM prompt evaluation:

 const wantSteps = expectedResult as UpdateStepsOutput
 const newSteps = wantSteps.newSteps
 return [
   {
     type: 'llm-rubric',
     value: newSteps
       ? `The new steps are contextually identical to '${JSON.stringify(newSteps)}'`
       : 'There are no new steps',
   },
 ]

The above assert uses an LLM for evaluation, but promptfoo also enables local testing for things like JSON validity, string inclusions, etc. 

Now, let’s take a closer look at evaluatePrompt() itself – where we finally start interacting with promptfoo. I’ve left guiding comments in-line:

export async function evaluatePrompt(
 testName: string,
 prompt: string,
 providers: Provider[],
 asserts: Assert[],
) {
 const providerConfigs: ProviderOptions[] = []
 providers.forEach((provider) => {
   providerConfigs.push(ProviderConfig[provider])
 })
  // Retrieve each specified provider and set it for each given assert)
 for (const assert of asserts) {
   const provider = assert.provider
   if (provider && typeof provider === 'number') {
     assert.provider = ProviderConfig[provider as Provider]
   }
 }


 // For now, one test per invocation, but we’ll be refining this.
 const tests = [
   {
     description: testName,
     assert: asserts,
     providers: providerConfigs,
   },
 ]
 try {
   // Invoke promptfoo with our chosen tests, provider options, and defaults
   const res = await promptfoo.evaluate(
     {
       prompts: [prompt],
       providers: [
         ProviderConfig[Provider.PortkeyOpenAI],
         ProviderConfig[Provider.PortkeyAzure],
       ],
       writeLatestResults: true,
       defaultTest: {
         options: {
           provider: {
             text: ProviderConfig[Provider.PortkeyOpenAI],
             embedding: ProviderConfig[Provider.PortkeyOpenAI],
           },
         },
       },
       tests: tests,
     },
     {
       maxConcurrency: 1,
       cache: true,
     },
   )


   return res
 } catch (e) {
   console.log('Failed to evaluate:', e)
 }
}


And that’s it! We can now consume promptfoo’s response from the test and fail/produce debug output/etc if there were any errors. 

Running these tests after any prompt updates helps us ensure that our prompt tweaks are not degrading agent quality over time by validating existing data points. Here is an example of a failing test after a prompt tweak resulted in undesirable output format:

And a passing test after fixing the issue:

Conclusion

This has been a pretty quick first iteration of prompt evaluation so far, but it’s already caught a few prompt issues in our workflows. We’ll continue optimizing the implementation as we go along.

Contact us if you want to talk more about your own experiences with prompt evaluation for agent development or have any questions about what we’re building.