AI quality / Content design

2026 / Case study draft

Foundational evals for Ads Manager AI

As part of the foundational evals team on the content design side, I stood up high-priority evals for Ads Manager AI and ad analysis quality.

Visual mockups

Foundational evals made AI quality inspectable.

Recreated artifacts showing how content design turned Ads Manager AI quality into foundational evals, ad analysis rubrics, failure modes, calibration examples, and engineering-ready criteria.

Foundational eval console

Scenario

Why did performance drop this week?

Ad analysis quality

Input signals

7 day comparison

Spend pacingnormal

Click-through rate-18%

Creative changed2 days ago

Candidate answer

Spend is pacing normally, but click-through rate dropped after the creative update. Review the new asset before changing budget.

Signal namedCaveatedNext step

Failure mode library

The team needed names for bad answers.

Generic optimization

Suggests budget or creative changes without a supporting signal.

False certainty

Turns a likely driver into a guaranteed cause.

Data echo

Repeats metrics without explaining what the user can do next.

Calibration packet

Content design became the review protocol.

Prompt

Ideal answer

Scored answer

Reviewer note

Reviewers could score the same answer against the same standard, then feed that signal back into prompts, product behavior, and engineering conversations.

Overview

What changed

This case positions evaluation as frontier content design work. As part of the foundational evals team on the content design side, I helped define what good Ads Manager AI should do: name the signal, respect uncertainty, use the advertiser's context, and give a next step specific enough to act on.

Role

My part

Stood up high-priority foundational evals for Ads Manager AI, translating content quality into observable response criteria.
Created many ad analysis quality evals across campaign performance, signal explanation, answer usefulness, and next-step guidance.
Wrote rubrics, failure modes, ideal answers, and reviewer guidance for AI-generated campaign analysis.
Partnered with marketing science experts, conversational designers, product managers, and engineers to align response quality with product trust.

Problem

The product moment was unclear

Ads analysis is only useful when the answer can explain what changed, why it may have happened, and what an advertiser can do next. Early AI responses could sound plausible while missing the signal, over-claiming causality, or giving generic advice. The team needed foundational evals that made quality concrete before the experience could scale.

Constraints

Rules of the work

The evals had to judge usefulness without pretending the system knew more than the data supported.
Marketing science expertise needed to become criteria that reviewers, prompts, and engineering workflows could all use.
Foundational evals had to be broad enough for core Ads Manager AI quality, while ad analysis evals had to be specific enough for campaign-performance answers.
Rubrics had to work across different campaign objectives, data availability, and advertiser sophistication levels.

Users

Who needed clarity

Advertisers asking why performance changed and what to inspect next.
Reviewers and content designers scoring AI answers for clarity, grounding, and actionability.
Engineers and product partners tuning prompts, data inputs, and response behavior.

Before / after

Examples in context, with the reason for each change

Ads Manager AI quality rubric

Foundational eval criterion for an AI answer explaining campaign performance.

Before

The answer is helpful and clear.

After

The answer names the performance change, explains likely drivers, gives one next diagnostic step, and avoids unsupported certainty.

Ads analysis response

AI answer in Ads Manager after an advertiser asks why results changed.

Before

Performance dropped this week. Consider increasing budget or changing creative.

After

Spend is pacing normally, but click-through rate dropped after the creative update. Review the new asset before changing budget.

Content decisions

The writing system underneath

Define good answers as behaviors

The foundational evals needed to describe what the answer does, not just how it feels. That made quality review less dependent on taste.

Names the changed metric, explains the likely driver, and gives one diagnostic next step.

Design for uncertainty

Ads analysis often works with partial evidence. The content standard had to help AI responses stay useful without overclaiming causality.

Use may or likely when the data suggests a driver but does not prove one.

Connect evals to product behavior

The evals became a bridge between content design, marketing science, conversation design, prompt behavior, and engineering implementation.

Fail answers that give advice without naming the campaign signal that supports it.

Process

How I got there

Mapped high-value ads analysis questions across performance, budget, creative, audience, and delivery scenarios.
Translated marketing science guidance into response criteria, ideal answer examples, and failure modes.
Stood up high-priority foundational evals for clarity, grounding, specificity, transparency, and next-step usefulness.
Created ad analysis quality evals that tested whether answers explained signals, caveated uncertainty, and gave useful next diagnostics.
Calibrated examples with content, product, engineering, and subject-matter partners so scores meant the same thing across reviewers.
Fed patterns back into prompts, response guidance, and quality conversations with engineering.

Outcome

Impact signals

Stood up high-priority foundational evals for Ads Manager AI on the content design side.
Created many ad analysis quality evals for AI-generated campaign-performance answers.
Aligned reviewers around observable standards for transparency, specificity, useful caveats, and actionability.
Gave prompt and engineering partners a content-quality language they could use when tuning product behavior.

Learnings

What I would carry forward

In AI work, content design moves upstream into the definition of quality.
A rubric is product language with consequences.
The best AI answer is not the most confident one. It is the one that helps the user make the next better decision.

Next project

Content design for a B2B group cart

Next case study

Back to all work