PMs Building AI Now Have to Manage 2 Products
Credit: the Pragmatic Engineer

PMs Building AI Now Have to Manage 2 Products

PMs working on AI products now have to manage two products.

This week, I found myself writing detailed requirements for our evals platform.

Take a scenario like: “Add a slide summarizing the feedback I got from customers this week in my inbox.”

For that eval to work well:

  • The model executing the eval needs an inbox with emails that realistically match the scenario (and other emails to create a real relevance challenge for AI)
  • The model judging the output needs to determine whether the agent pulled the right context from that inbox
  • It also needs to assess whether the key points were extracted and summarized in a useful way

At scale, this gets complicated fast.

If you want to approximate real customer contexts, you need to simulate evolving workstreams across email, chat, files, and more. All of that has to be designed, maintained, and measured.

It’s the product behind the product.

And beyond that, high-quality evals need to be treated like a product themselves. They need clear success metrics to show:

  • Whether they cover enough of the scenario space that matters to customers
  • Whether scoring actually aligns with human judgment, especially customer judgment
  • Whether the insights they generate can be turned into product improvements efficiently

For PMs building AI products, the customer-facing experience is only part of the job.

You also have to build the system that tells you whether the product is actually working.

A product behind the product.

It felt like you are holding back…is there a framework you follow? “Whether they cover enough of the scenario space that matters to customers” — how do you decide which scenarios that matter? —- This is the framework I’ve been following (give me feedback) Playbook for Evals: - Define problem, scenarios & metrics - Write the Evals - Operationalize the Evals pre-prod: - Golden Dataset - (If needed) Simulate Usage - Define Launch Plan post-prod: - Expert Human Evals - Online Evals - Iterate and hill climb to improve Evals Inspired by: https://www.anthropic.com/engineering/demystifying-evals-for-ai-agentshttps:// www.ddmckinnon.com/2025/03/30/show-dont-tell-a-llama-pms-guide-to-writing-genai-evals/

Okay, I'll bite. Running effective production-grade tests has always been challenging in my realm even before AI (in app purchases on partner platforms, international forms of payment, navigating fraud rules designed to block the exact test you're trying to run...) but I can see how this ups the ante. How do you approach it? Anonymized [real] user data? Synthetic data?

Like
Reply

To view or add a comment, sign in

More articles by Georges Krinker

Explore content categories