AI quality / Content design
2026 / Case study
Foundational evals for Ads Manager AI
I helped turn Ads Manager AI quality into something reviewers, prompt partners, and engineers could evaluate: grounded answers, useful caveats, clear next steps, and observable failure modes for campaign-performance guidance.
Visual mockups
Foundational evals made AI quality inspectable.
Recreated artifacts showing how content design turned Ads Manager AI quality into foundational evals, ad analysis rubrics, failure modes, calibration examples, and engineering-ready criteria.
Foundational eval console
Scenario
Why did performance drop this week?
Input signals
7 day comparisonCandidate answer
Spend is pacing normally, but click-through rate dropped after the creative update. Review the new asset before changing budget.
Failure mode library
The team needed names for bad answers.
Generic optimization
Suggests budget or creative changes without a supporting signal.
False certainty
Turns a likely driver into a guaranteed cause.
Data echo
Repeats metrics without explaining what the user can do next.
Calibration packet
Content design became the review protocol.
01
Prompt
02
Ideal answer
03
Scored answer
04
Reviewer note
Reviewers could score the same answer against the same standard, then feed that signal back into prompts, product behavior, and engineering conversations.
Overview
What changed
This case positions evaluation as frontier content design work. As part of the foundational evals team on the content design side, I helped define what good Ads Manager AI should do: name the signal, respect uncertainty, use the advertiser's context, and give a next step specific enough to act on.
Role
My part
- Stood up high-priority foundational evals for Ads Manager AI, translating content quality into observable response criteria.
- Created 48 ad-analysis eval scenarios across budget pacing, delivery drops, creative changes, audience shifts, and performance-diagnostic questions.
- Wrote rubrics, failure modes, ideal answers, and reviewer guidance for AI-generated campaign analysis.
- Partnered with Marketing Science, conversational design, Product, and Engineering to calibrate response-quality standards and feed findings back into prompts and product behavior.
Problem
The product moment was unclear
Ads analysis is only useful when the answer can explain what changed, why it may have happened, and what an advertiser can do next. Early AI responses could sound plausible while missing the signal, over-claiming causality, or giving generic advice. The team needed foundational evals that made quality concrete before the experience could scale.
Constraints
Rules of the work
- The evals had to judge usefulness without pretending the system knew more than the data supported.
- Marketing science expertise needed to become criteria that reviewers, prompts, and engineering workflows could all use.
- Foundational evals had to be broad enough for core Ads Manager AI quality, while ad analysis evals had to be specific enough for campaign-performance answers.
- Rubrics had to work across different campaign objectives, data availability, and advertiser sophistication levels.
Users
Who needed clarity
- Advertisers asking why performance changed and what to inspect next.
- Reviewers and content designers scoring AI answers for clarity, grounding, and actionability.
- Engineers and product partners tuning prompts, data inputs, and response behavior.
Before / after
Examples in context, with the reason for each change
Ads Manager AI quality rubric
Foundational eval criterion for an AI answer explaining campaign performance.
Before
The answer is helpful and clear.
After
The answer names the performance change, explains likely drivers, gives one next diagnostic step, and avoids unsupported certainty.
Ads analysis response
AI answer in Ads Manager after an advertiser asks why results changed.
Before
Performance dropped this week. Consider increasing budget or changing creative.
After
Spend is pacing normally, but click-through rate dropped after the creative update. Review the new asset before changing budget.
Content decisions
The writing system underneath
Define good answers as behaviors
The foundational evals needed to describe what the answer does, not just how it feels. That made quality review less dependent on taste.
Names the changed metric, explains the likely driver, and gives one diagnostic next step.
Design for uncertainty
Ads analysis often works with partial evidence. The content standard had to help AI responses stay useful without overclaiming causality.
Use may or likely when the data suggests a driver but does not prove one.
Connect evals to product behavior
The evals became a bridge between content design, marketing science, conversation design, prompt behavior, and engineering implementation.
Fail answers that give advice without naming the campaign signal that supports it.
Process
How I got there
- Mapped high-value ads analysis questions across performance, budget, creative, audience, and delivery scenarios.
- Translated marketing science guidance into response criteria, ideal answer examples, and failure modes.
- Stood up high-priority foundational evals for clarity, grounding, specificity, transparency, and next-step usefulness.
- Created ad analysis quality evals that tested whether answers explained signals, caveated uncertainty, and gave useful next diagnostics.
- Calibrated examples with content, product, engineering, and subject-matter partners so scores meant the same thing across reviewers.
- Fed patterns back into prompts, response guidance, and quality conversations with engineering.
Outcome
Impact signals
- Created 48 ad-analysis eval scenarios across budget pacing, delivery drops, creative changes, audience shifts, and performance-diagnostic questions.
- Improved sampled human-review acceptance for AI-generated advertiser guidance from 74% to 88% after rubric, prompt-content, and reviewer-calibration updates.
- Reduced reviewer-flagged issue rate from 26% to 12% by standardizing how responses handled grounding, specificity, uncertainty, and next-step usefulness.
- Gave prompt and Engineering partners repeatable criteria for identifying generic optimization, false certainty, data echo, missing caveats, and ungrounded next steps.
Measurement note: Sampled human review compared AI-generated advertiser guidance before and after rubric, prompt-content, and reviewer-calibration updates. Acceptance meant the response passed review for grounding, specificity, uncertainty handling, and next-step usefulness without content-related revision. The reviewer-flagged issue rate dropped from 26% to 12% because the acceptance rate increased from 74% to 88%.
Learnings
What I would carry forward
- In AI work, content design moves upstream into the definition of quality.
- A rubric is product language with consequences.
- The best AI answer is not the most confident one. It is the one that helps the user make the next better decision.
Next project