Copy QA Pipeline

Back to Documents

Copy QA Pipeline

Multi-Model Quality Assurance for Ad Copy

5-gate pipeline 3 AI models + arbitration ~$0.33 per refresh cycle

Overview

The pipeline validates every ad copy variant through up to five sequential gates before it reaches Jon for review. Each gate evaluates different quality dimensions using a different AI model, providing diverse perspectives that catch issues a single model would miss.

Non-blocking

QA failures never block ad copy generation. If the pipeline crashes, copy still gets produced (wrapped in try/except at all integration points).

Jon approves everything

QA results are recommendations. Variants marked NEEDS_REVIEW go to Jon for a human decision.

Fail fast

Gate 0 (rule-based, $0) runs first and blocks AI review if basic compliance fails. No money wasted on copy that violates character limits.

Score enforcement

Every AI model is told to show its arithmetic. Post-processing overrides deviant scores that don't match the weighted formula.

Cost-controlled

Gate 4 (Opus, the most expensive) only fires when reviewers disagree or scores fall in a borderline zone. Typical cost: ~$0.33 per 15-variant refresh cycle.

Quick Start

Dry run (Gate 0 only, no API calls)

make copy-qa-dry

Or directly:

python scripts/copy_qa_pipeline.py \
  --input ./reports/ad-copy \
  --platform meta \
  --language en \
  --region us \
  --gate0-only

Full pipeline (requires API keys)

make copy-qa

Or directly:

python scripts/copy_qa_pipeline.py \
  --input ./reports/ad-copy \
  --platform meta \
  --language en \
  --region us \
  --output-dir ./reports/ad-copy/qa

Generate + QA in one step

python scripts/ad_copy_generator.py generate \
  --platform meta \
  --product "Thick-Cut Filet" \
  --region eu \
  --language de \
  --variants 5 \
  --output-dir ./reports/ad-copy

QA runs automatically after generation. Add --skip-qa to bypass.

Architecture

Input
Variants (JSON)
Gate 0 - Rule-based
Platform compliance, character limits, banned words, language detection, glossary enforcement, D2C geo-fence
$0/variant
FAIL = stop, no AI review | PASS = continue
Gate 1 - Gemini 2.5 Flash
Brand voice, CTA clarity, audience match, cultural fit. Few-shot examples in prompt.
brand*0.30 + cta*0.25 + audience*0.25 + cultural*0.20
~$0.001/variant
Gate 2 - GPT-4.1
Fluency, persuasion, platform fit.
fluency*0.35 + persuasion*0.35 + platform_fit*0.30
~$0.005/variant
Gate 3 - Claude Sonnet 4.6
Semantic fidelity, register consistency, competitive differentiation. Back-translation check for non-English.
semantic*0.35 + register*0.35 + differentiation*0.30
~$0.008/variant
Arbitration triggers? (borderline scores, reviewer disagreement)
Gate 4 - Claude Opus 4.6 (Conditional)
Reviews all 3 gate results. Renders verdict: PASS / NEEDS_REVIEW / FAIL. Double-weighted in final score.
~$0.04/variant
Output
Score aggregation, issue deduplication, status determination. BatchQAResult (JSON + markdown reports).

Gate 0 - Platform Compliance

File: scripts/lib/qa_gate0_compliance.py Cost: $0 Model: None (rule-based)

Gate 0 runs first on every variant. If it fails, no AI gates run. This saves money on obviously non-compliant copy.

Checks performed

CheckWhat it validates
Required fieldsAll platform-required fields are present and non-empty
Character limitsPer-platform, per-field limits (see table below)
Banned wordsTarget-language equivalents of the COPY_AVOID list
Language detectionStopword frequency heuristic confirms copy is in the target language
Glossary complianceProduct names not mangled, banned term alternatives absent
D2C geo-fenceFish creative restricted to CA/WA/OR; no-discount geos enforced

Character limits by platform

PlatformFieldLimit
Metaprimary_text125 chars
Metaheadline40 chars
Metadescription30 chars
Google RSAheadline30 chars
Google RSAdescription90 chars
Klaviyosubject50 chars
Klaviyopreview90 chars
Klaviyobody2000 chars

Result type

@dataclass
class Gate0Result:
    passed: bool
    issues: list[Gate0Issue]
    checks_run: int
    checks_passed: int

@dataclass
class Gate0Issue:
    field: str       # Which field failed (e.g., "primary_text")
    check: str       # Check name (e.g., "char_limit")
    severity: str    # HIGH, MEDIUM, LOW
    problem: str     # Human-readable description
    suggestion: str  # Fix suggestion

Gate 1 - Gemini 2.5 Flash

File: scripts/lib/qa_gemini_client.py Cost: ~$0.001/variant Model: gemini-2.5-flash API key: GEMINI_API_KEY

Evaluates brand alignment and creative quality.

Dimensions (0-100 each)

DimensionWeightWhat it measures
brand_voice0.30Irreverent, chef-forward, "big meat energy" - NOT preachy/green/salad
cta_clarity0.25Clear, compelling, platform-appropriate call to action
audience_match0.25Targets flexitarians/meat-curious, NOT vegans/vegetarians
cultural_fit0.20For non-English: feels native, culturally adapted, not literally translated

Score formula: overall = brand_voice*0.30 + cta_clarity*0.25 + audience_match*0.25 + cultural_fit*0.20

The prompt includes 2 good + 2 bad few-shot examples per language/platform combination (Fix #3) and the terminology glossary for the target language.

Gate 2 - GPT-4.1

File: scripts/lib/qa_openai_client.py Cost: ~$0.005/variant Model: gpt-4.1 API key: OPENAI_API_KEY

Evaluates linguistic quality and persuasive effectiveness.

Dimensions (0-100 each)

DimensionWeightWhat it measures
fluency0.35Reads naturally in target language, no awkward phrasing
persuasion0.35Compelling enough to drive clicks/conversions
platform_fit0.30Follows platform best practices (Meta conversational vs Google keyword-rich)

Score formula: overall = fluency*0.35 + persuasion*0.35 + platform_fit*0.30

Gate 3 - Claude Sonnet 4.6

File: scripts/lib/qa_anthropic_client.py Cost: ~$0.008/variant Model: claude-sonnet-4-6 API key: ANTHROPIC_API_KEY

Evaluates semantic depth and cross-field consistency. For non-English copy, includes a back-translation check (Fix #4).

Dimensions (0-100 each)

DimensionWeightWhat it measures
semantic_fidelity0.35For translated copy: preserves emotional weight and persuasive force
register_consistency0.35Tone consistent across all variant fields (headline + primary_text + description)
competitive_differentiation0.30Stands out, avoids generic food marketing cliches

Score formula: overall = semantic_fidelity*0.35 + register_consistency*0.35 + competitive_differentiation*0.30

Back-translation (Fix #4)

For non-English copy, the Gate 3 prompt instructs Claude to:

  1. Translate the ad copy back to English
  2. Compare with the original brief/seed copy
  3. Flag any semantic drift (meaning changed, emphasis shifted, CTA weakened)
  4. Factor drift into the semantic_fidelity score

This happens within the same API call - no extra cost.

Gate 4 - Claude Opus 4.6 (Conditional)

File: scripts/lib/qa_anthropic_client.py Cost: ~$0.04/variant (only when triggered) Model: claude-opus-4-6 API key: ANTHROPIC_API_KEY

Opus arbitration fires only when reviewers disagree or scores fall in a borderline zone. It reviews all three previous gate results and renders a binding verdict.

Arbitration triggers

Gate 4 fires when ALL of these conditions are met:

  1. Borderline zone: Any Gate 1/2/3 score is between 80-89, OR the spread between highest and lowest reviewer exceeds 15 points
  2. Combined score range: The combined score before arbitration is between 75-94

If the combined score is already >= 95 (clearly excellent) or < 75 (clearly bad), Opus is skipped.

Verdict effects

VerdictEffect
PASSNo change to status
NEEDS_REVIEWOverrides PASSED to NEEDS_REVIEW
FAILOverrides any status to FAILED

The Opus score is double-weighted in the final aggregation formula.

Scoring and Pass/Fail Logic

Score aggregation

Without arbitration (most variants):

combined = round((Gate1 + Gate2 + Gate3) / 3)

With Opus arbitration:

combined = round((Gate1 + Gate2 + Gate3 + Gate4*2) / 5)

Thresholds

Combined ScoreStatusWhat happens
>= 85PASSEDReady for Jon to deploy
75 - 84NEEDS_REVIEWJon reviews manually
< 75FAILEDAuto-rejected, regenerate

Blocking rules (override score thresholds)

RuleThresholdWhy
High-severity issueANY presentCritical problems are non-negotiable
Medium-severity overload> 2 issuesToo many quality concerns
Brand voice floor< 70 from any reviewerBrand is non-negotiable for Juicy Marbles
Audience match floor< 75 from any reviewerWrong audience wastes ad spend

Severity classification

SeverityExamples
HIGHWrong language, banned word present, completely off-brand, factually incorrect product claims, missing CTA, wrong product/region
MEDIUMSlightly generic tone, suboptimal word choice, borderline char limit, minor cultural awkwardness
LOWStyle preferences, regional variations, minor formatting

Score formula enforcement (Fix #6)

Every AI gate prompt requires the model to show its arithmetic using the weighted formula. After receiving the response, the pipeline recalculates the overall score from the dimension scores. If the model's overall deviates by more than 2 points from the formula, the pipeline overrides it with the calculated value and sets _score_override: true in the result.

Six TEEI Gap Fixes

Issues identified in the TEEI translation QA system that were fixed from the start in this pipeline.

Fix #1

Terminology Glossary

Locked product names and approved translations per language. Gate 0 enforces these rules before any AI review.

scripts/lib/qa_glossary.py
Fix #2

Issue Deduplication

After all gates run, issues are deduplicated by (field, category). Fuzzy matching merges similar issues. Unanimous flags (3+ models agree) get severity bumped up.

scripts/copy_qa_pipeline.py
Fix #3

Few-Shot Examples

Every AI gate prompt includes 2 good + 2 bad examples for the target language and platform, with scores and explanations.

scripts/lib/qa_prompts.py
Fix #4

Back-Translation Check

Built into Gate 3. For non-English copy, Claude translates back to English and compares with the original brief. Semantic drift affects the score.

Gate 3 prompt (no extra API call)
Fix #5

Cross-Variant Terminology

Within a batch, the pipeline tracks term choices across all variants for non-English copy and flags inconsistencies.

scripts/copy_qa_pipeline.py
Fix #6

Score Formula Enforcement

Post-processing recalculates scores from dimension values. Overrides if the model's score deviates by > 2 points. _score_override flag tracks this.

All gate clients

Integration Points

ad_copy_generator.py (post-generation QA)

After generate_copy_variants() produces variants, QA runs automatically:

from lib.copy_qa_pipeline import validate_batch

qa_result = validate_batch(
    variants,
    platform=platform,
    language=language,
    region=region,
    skip_ai=args.skip_qa,
)

The QA summary is included in the output JSON. Add --skip-qa to bypass QA entirely. QA is wrapped in try/except - if it fails, variants are still output without QA scores.

creative_refresh.py (replacing _score_variant)

The creative refresh loop uses the QA pipeline to score replacement copy:

from lib.copy_qa_pipeline import validate_batch

qa = validate_batch(
    replacements,
    platform="meta",
    language=language or "en",
    region=region or "us",
    skip_ai=skip_qa,
)

Variants scoring below the threshold (combined_score < 85) are filtered out. The existing --min-score 7 flag (on a 1-10 scale) maps to ~85 on the 0-100 QA scale.

copy_quality_scorer.py (foundation for Gate 0)

Gate 0 imports existing functions from copy_quality_scorer.py: CHAR_LIMITS, check_banned_words(), check_char_limits(). No code was duplicated.

CLI Usage

copy_qa_pipeline.py

usage: copy_qa_pipeline.py [-h] --input INPUT [--platform {meta,google,klaviyo}]
                           [--language LANGUAGE] [--region REGION]
                           [--output-dir OUTPUT_DIR] [--gate0-only]

options:
  --input          Directory containing ad copy JSON files, or a single JSON file
  --platform       Ad platform: meta, google, klaviyo (default: meta)
  --language       Language code: en, de, it, es (default: en)
  --region         Region code: us, eu (default: us)
  --output-dir     Directory to save QA results (default: ./reports/ad-copy/qa)
  --gate0-only     Run Gate 0 only - no API calls, no cost

Variant field names by platform

PlatformRequired fields
Metaprimary_text, headline, description
Google RSAheadlines (list), descriptions (list)
Klaviyosubject, preview, body

Output

The pipeline produces two files in the output directory:

  1. qa_results_YYYYMMDD_HHMMSS.json - Full structured results (all gate scores, issues, blocking reasons)
  2. qa_report_YYYYMMDD_HHMMSS.md - Human-readable markdown report

Makefile targets

make copy-qa       # Full pipeline (Meta, English, US)
make copy-qa-dry   # Gate 0 only (no API calls, $0)

Terminology Glossary

Universal terms (never translate)

Juicy Marbles Thick-Cut Filet Baby Ribs Umami Burger Kinda Cod Kinda Salmon Loin

Approved translations

EnglishGerman (de)Italian (it)Spanish (es)
plant-basedpflanzlichvegetalevegetal
flavorGeschmacksaporesabor
enjoymentGenuss--

Banned alternatives (never use)

LanguageBanned terms
Germanveganes Fleisch, Fleischersatz, gesund, pflanzenbasiert
Italiancarne vegana, sostituto della carne, sano
Spanishcarne vegana, sustituto de carne, saludable

Cost Breakdown

Per variant

GateModelCost
0 (rule-based)None$0
1 (brand + tone)Gemini 2.5 Flash~$0.001
2 (fluency + persuasion)GPT-4.1~$0.005
3 (semantic depth)Claude Sonnet 4.6~$0.008
4 (arbitration, ~20% triggered)Claude Opus 4.6~$0.04

Per creative refresh cycle (15 variants)

ComponentCost
Gates 0-3 (all 15)~$0.21
Gate 4 (~3 variants)~$0.12
Total~$0.33

Monthly cost (weekly refresh)

PeriodCost
Weekly~$0.33
Monthly (4 cycles)~$1.32

API Keys

KeyRequired forEnv var
GeminiGate 1GEMINI_API_KEY
OpenAIGate 2OPENAI_API_KEY
AnthropicGates 3 + 4ANTHROPIC_API_KEY

All keys are optional. If a key is missing, the corresponding gate logs an error and returns a zero score (it does not crash the pipeline). For --gate0-only / make copy-qa-dry, no API keys are needed.

File Reference

scripts/
  copy_qa_pipeline.py              # Orchestrator (790 LOC)
                                   # - BatchQAResult, VariantQAResult dataclasses
                                   # - deduplicate_issues() (Fix #2)
                                   # - check_terminology_consistency() (Fix #5)
                                   # - _aggregate_scores(), _determine_status()
                                   # - validate_variant(), validate_batch()
                                   # - generate_report() (JSON + markdown)
                                   # - CLI entry point (main)
  lib/
    __init__.py
    qa_gate0_compliance.py         # Gate 0: platform + language + glossary (~250 LOC)
    qa_gemini_client.py            # Gate 1: Gemini reviewer (~200 LOC)
    qa_openai_client.py            # Gate 2: GPT reviewer (~200 LOC)
    qa_anthropic_client.py         # Gate 3+4: Sonnet + Opus (~300 LOC)
    qa_retry_utils.py              # Retry infra (~276 LOC)
    qa_prompts.py                  # Prompt builders + few-shot (~400 LOC)
    qa_language_detect.py          # Stopword-based language detection (~228 LOC)
    qa_glossary.py                 # Locked terminology per language (~218 LOC)

Testing

All tests mock external API calls. No real API keys are needed to run the test suite.

# Run all tests
make test

# Run only QA pipeline tests
python -m pytest tests/test_copy_qa_pipeline.py tests/test_qa_*.py -v

Test coverage by module

Test fileTestsCoverage
test_copy_qa_pipeline.py~73Orchestrator: scoring, dedup, terminology, batch, status, blocking rules
test_qa_gate0_compliance.py~50Gate 0: char limits, banned words, language detect, glossary, geo-fence
test_qa_gemini_client.py32Gate 1: response parsing, score coercion, batch, metadata, formula enforcement
test_qa_openai_client.py34Gate 2: response parsing, markdown stripping, batch partial failures, edge cases
test_qa_anthropic_client.py52Gates 3+4: parsing, back-translation, arbitration triggers, verdict handling
test_qa_retry_utils.py~20Retry: backoff, jitter, retryable detection, timeout
test_qa_language_detect.py~30Language detection: all 4 languages, short text, strong markers, edge cases

Data Types Reference

BatchQAResult

@dataclass
class BatchQAResult:
    variants: list[VariantQAResult]   # One result per input variant
    summary: dict                      # Aggregate counts
    terminology_report: dict           # Cross-variant term consistency
    generated_at: str                  # ISO timestamp

    # Properties:
    .passed        -> list[VariantQAResult]  # status PASSED
    .needs_review  -> list[VariantQAResult]  # status NEEDS_REVIEW
    .failed        -> list[VariantQAResult]  # status FAILED

VariantQAResult

@dataclass
class VariantQAResult:
    variant_index: int             # Position in input batch
    variant: dict                  # The original variant dict
    status: str                    # PASSED, NEEDS_REVIEW, FAILED
    combined_score: int            # 0-100 aggregate score
    gate0: Gate0Result | None      # Platform compliance result
    gate1: dict | None             # Gemini result
    gate2: dict | None             # GPT result
    gate3: dict | None             # Sonnet result
    gate4: dict | None             # Opus result (if arbitrated)
    issues: list[dict]             # Deduplicated issues from all gates
    blocking_reasons: list[str]    # Why it was blocked (if FAILED)
    arbitrated: bool               # Whether Gate 4 was triggered

Summary dict

{
    "total": 5,
    "passed": 3,
    "needs_review": 1,
    "failed": 1,
    "arbitrated": 2,
    "avg_score": 82,
    "terminology_inconsistencies": 0
}

Issue dict (after deduplication)

{
    "field": "primary_text",
    "severity": "MEDIUM",
    "category": "fluency",
    "problem": "Awkward phrasing",
    "suggestion": "Rephrase...",
    "_sources": ["gate1", "gate2"],
    "_flagged_by": 2,
    "_unanimous": false
}