William Demoraes

AI quality / Content design

2026 / Case study

Foundational evals for Ads Manager AI

I helped turn Ads Manager AI quality into something reviewers, prompt partners, and engineers could evaluate: grounded answers, useful caveats, clear next steps, and observable failure modes for campaign-performance guidance.

Visual mockups

Foundational evals made AI quality inspectable.

Recreated artifacts showing how content design turned Ads Manager AI quality into foundational evals, ad analysis rubrics, failure modes, calibration examples, and engineering-ready criteria.

Foundational eval console

Scenario

Why did performance drop this week?

Ad analysis quality

Input signals

7 day comparison
Spend pacingnormal
Click-through rate-18%
Creative changed2 days ago

Candidate answer

Spend is pacing normally, but click-through rate dropped after the creative update. Review the new asset before changing budget.

Signal namedCaveatedNext step

Failure mode library

The team needed names for bad answers.

Generic optimization

Suggests budget or creative changes without a supporting signal.

False certainty

Turns a likely driver into a guaranteed cause.

Data echo

Repeats metrics without explaining what the user can do next.

Calibration packet

Content design became the review protocol.

01

Prompt

02

Ideal answer

03

Scored answer

04

Reviewer note

Reviewers could score the same answer against the same standard, then feed that signal back into prompts, product behavior, and engineering conversations.

Overview

What changed

This case positions evaluation as frontier content design work. As part of the foundational evals team on the content design side, I helped define what good Ads Manager AI should do: name the signal, respect uncertainty, use the advertiser's context, and give a next step specific enough to act on.

Role

My part

  • Stood up high-priority foundational evals for Ads Manager AI, translating content quality into observable response criteria.
  • Created 48 ad-analysis eval scenarios across budget pacing, delivery drops, creative changes, audience shifts, and performance-diagnostic questions.
  • Wrote rubrics, failure modes, ideal answers, and reviewer guidance for AI-generated campaign analysis.
  • Partnered with Marketing Science, conversational design, Product, and Engineering to calibrate response-quality standards and feed findings back into prompts and product behavior.

Problem

The product moment was unclear

Ads analysis is only useful when the answer can explain what changed, why it may have happened, and what an advertiser can do next. Early AI responses could sound plausible while missing the signal, over-claiming causality, or giving generic advice. The team needed foundational evals that made quality concrete before the experience could scale.

Constraints

Rules of the work

  • The evals had to judge usefulness without pretending the system knew more than the data supported.
  • Marketing science expertise needed to become criteria that reviewers, prompts, and engineering workflows could all use.
  • Foundational evals had to be broad enough for core Ads Manager AI quality, while ad analysis evals had to be specific enough for campaign-performance answers.
  • Rubrics had to work across different campaign objectives, data availability, and advertiser sophistication levels.

Users

Who needed clarity

  • Advertisers asking why performance changed and what to inspect next.
  • Reviewers and content designers scoring AI answers for clarity, grounding, and actionability.
  • Engineers and product partners tuning prompts, data inputs, and response behavior.

Before / after

Examples in context, with the reason for each change

Ads Manager AI quality rubric

Foundational eval criterion for an AI answer explaining campaign performance.

Before

The answer is helpful and clear.

After

The answer names the performance change, explains likely drivers, gives one next diagnostic step, and avoids unsupported certainty.

Ads analysis response

AI answer in Ads Manager after an advertiser asks why results changed.

Before

Performance dropped this week. Consider increasing budget or changing creative.

After

Spend is pacing normally, but click-through rate dropped after the creative update. Review the new asset before changing budget.

Content decisions

The writing system underneath

Define good answers as behaviors

The foundational evals needed to describe what the answer does, not just how it feels. That made quality review less dependent on taste.

Names the changed metric, explains the likely driver, and gives one diagnostic next step.

Design for uncertainty

Ads analysis often works with partial evidence. The content standard had to help AI responses stay useful without overclaiming causality.

Use may or likely when the data suggests a driver but does not prove one.

Connect evals to product behavior

The evals became a bridge between content design, marketing science, conversation design, prompt behavior, and engineering implementation.

Fail answers that give advice without naming the campaign signal that supports it.

Process

How I got there

  • Mapped high-value ads analysis questions across performance, budget, creative, audience, and delivery scenarios.
  • Translated marketing science guidance into response criteria, ideal answer examples, and failure modes.
  • Stood up high-priority foundational evals for clarity, grounding, specificity, transparency, and next-step usefulness.
  • Created ad analysis quality evals that tested whether answers explained signals, caveated uncertainty, and gave useful next diagnostics.
  • Calibrated examples with content, product, engineering, and subject-matter partners so scores meant the same thing across reviewers.
  • Fed patterns back into prompts, response guidance, and quality conversations with engineering.

Outcome

Impact signals

  • Created 48 ad-analysis eval scenarios across budget pacing, delivery drops, creative changes, audience shifts, and performance-diagnostic questions.
  • Improved sampled human-review acceptance for AI-generated advertiser guidance from 74% to 88% after rubric, prompt-content, and reviewer-calibration updates.
  • Reduced reviewer-flagged issue rate from 26% to 12% by standardizing how responses handled grounding, specificity, uncertainty, and next-step usefulness.
  • Gave prompt and Engineering partners repeatable criteria for identifying generic optimization, false certainty, data echo, missing caveats, and ungrounded next steps.

Measurement note: Sampled human review compared AI-generated advertiser guidance before and after rubric, prompt-content, and reviewer-calibration updates. Acceptance meant the response passed review for grounding, specificity, uncertainty handling, and next-step usefulness without content-related revision. The reviewer-flagged issue rate dropped from 26% to 12% because the acceptance rate increased from 74% to 88%.

Learnings

What I would carry forward

  • In AI work, content design moves upstream into the definition of quality.
  • A rubric is product language with consequences.
  • The best AI answer is not the most confident one. It is the one that helps the user make the next better decision.

Next project

Content design for a B2B group cart

Next case study

Back to all work