AI quality / Content design
2026 / Case study draft
Foundational evals for Ads Manager AI
As part of the foundational evals team on the content design side, I stood up high-priority evals for Ads Manager AI and ad analysis quality.
Visual mockups
Foundational evals made AI quality inspectable.
Recreated artifacts showing how content design turned Ads Manager AI quality into foundational evals, ad analysis rubrics, failure modes, calibration examples, and engineering-ready criteria.
Foundational eval console
Scenario
Why did performance drop this week?
Input signals
7 day comparisonCandidate answer
Spend is pacing normally, but click-through rate dropped after the creative update. Review the new asset before changing budget.
Failure mode library
The team needed names for bad answers.
Generic optimization
Suggests budget or creative changes without a supporting signal.
False certainty
Turns a likely driver into a guaranteed cause.
Data echo
Repeats metrics without explaining what the user can do next.
Calibration packet
Content design became the review protocol.
01
Prompt
02
Ideal answer
03
Scored answer
04
Reviewer note
Reviewers could score the same answer against the same standard, then feed that signal back into prompts, product behavior, and engineering conversations.
Overview
What changed
This case positions evaluation as frontier content design work. As part of the foundational evals team on the content design side, I helped define what good Ads Manager AI should do: name the signal, respect uncertainty, use the advertiser's context, and give a next step specific enough to act on.
Role
My part
- Stood up high-priority foundational evals for Ads Manager AI, translating content quality into observable response criteria.
- Created many ad analysis quality evals across campaign performance, signal explanation, answer usefulness, and next-step guidance.
- Wrote rubrics, failure modes, ideal answers, and reviewer guidance for AI-generated campaign analysis.
- Partnered with marketing science experts, conversational designers, product managers, and engineers to align response quality with product trust.
Problem
The product moment was unclear
Ads analysis is only useful when the answer can explain what changed, why it may have happened, and what an advertiser can do next. Early AI responses could sound plausible while missing the signal, over-claiming causality, or giving generic advice. The team needed foundational evals that made quality concrete before the experience could scale.
Constraints
Rules of the work
- The evals had to judge usefulness without pretending the system knew more than the data supported.
- Marketing science expertise needed to become criteria that reviewers, prompts, and engineering workflows could all use.
- Foundational evals had to be broad enough for core Ads Manager AI quality, while ad analysis evals had to be specific enough for campaign-performance answers.
- Rubrics had to work across different campaign objectives, data availability, and advertiser sophistication levels.
Users
Who needed clarity
- Advertisers asking why performance changed and what to inspect next.
- Reviewers and content designers scoring AI answers for clarity, grounding, and actionability.
- Engineers and product partners tuning prompts, data inputs, and response behavior.
Before / after
Examples in context, with the reason for each change
Ads Manager AI quality rubric
Foundational eval criterion for an AI answer explaining campaign performance.
Before
The answer is helpful and clear.
After
The answer names the performance change, explains likely drivers, gives one next diagnostic step, and avoids unsupported certainty.
Ads analysis response
AI answer in Ads Manager after an advertiser asks why results changed.
Before
Performance dropped this week. Consider increasing budget or changing creative.
After
Spend is pacing normally, but click-through rate dropped after the creative update. Review the new asset before changing budget.
Content decisions
The writing system underneath
Define good answers as behaviors
The foundational evals needed to describe what the answer does, not just how it feels. That made quality review less dependent on taste.
Names the changed metric, explains the likely driver, and gives one diagnostic next step.
Design for uncertainty
Ads analysis often works with partial evidence. The content standard had to help AI responses stay useful without overclaiming causality.
Use may or likely when the data suggests a driver but does not prove one.
Connect evals to product behavior
The evals became a bridge between content design, marketing science, conversation design, prompt behavior, and engineering implementation.
Fail answers that give advice without naming the campaign signal that supports it.
Process
How I got there
- Mapped high-value ads analysis questions across performance, budget, creative, audience, and delivery scenarios.
- Translated marketing science guidance into response criteria, ideal answer examples, and failure modes.
- Stood up high-priority foundational evals for clarity, grounding, specificity, transparency, and next-step usefulness.
- Created ad analysis quality evals that tested whether answers explained signals, caveated uncertainty, and gave useful next diagnostics.
- Calibrated examples with content, product, engineering, and subject-matter partners so scores meant the same thing across reviewers.
- Fed patterns back into prompts, response guidance, and quality conversations with engineering.
Outcome
Impact signals
- Stood up high-priority foundational evals for Ads Manager AI on the content design side.
- Created many ad analysis quality evals for AI-generated campaign-performance answers.
- Aligned reviewers around observable standards for transparency, specificity, useful caveats, and actionability.
- Gave prompt and engineering partners a content-quality language they could use when tuning product behavior.
Learnings
What I would carry forward
- In AI work, content design moves upstream into the definition of quality.
- A rubric is product language with consequences.
- The best AI answer is not the most confident one. It is the one that helps the user make the next better decision.
Next project