Copy QA Pipeline

Multi-Model Quality Assurance for Ad Copy

5-gate pipeline 3 AI models + arbitration ~$0.33 per refresh cycle

Overview

The pipeline validates every ad copy variant through up to five sequential gates before it reaches Jon for review. Each gate evaluates different quality dimensions using a different AI model, providing diverse perspectives that catch issues a single model would miss.

Non-blocking

QA failures never block ad copy generation. If the pipeline crashes, copy still gets produced (wrapped in try/except at all integration points).

Jon approves everything

QA results are recommendations. Variants marked NEEDS_REVIEW go to Jon for a human decision.

Fail fast

Gate 0 (rule-based, $0) runs first and blocks AI review if basic compliance fails. No money wasted on copy that violates character limits.

Score enforcement

Every AI model is told to show its arithmetic. Post-processing overrides deviant scores that don't match the weighted formula.

Cost-controlled

Gate 4 (Opus, the most expensive) only fires when reviewers disagree or scores fall in a borderline zone. Typical cost: ~$0.33 per 15-variant refresh cycle.

Quick Start

Dry run (Gate 0 only, no API calls)

make copy-qa-dry

Or directly:

python scripts/copy_qa_pipeline.py \
  --input ./reports/ad-copy \
  --platform meta \
  --language en \
  --region us \
  --gate0-only

Full pipeline (requires API keys)

make copy-qa

Or directly:

python scripts/copy_qa_pipeline.py \
  --input ./reports/ad-copy \
  --platform meta \
  --language en \
  --region us \
  --output-dir ./reports/ad-copy/qa

Generate + QA in one step

python scripts/ad_copy_generator.py generate \
  --platform meta \
  --product "Thick-Cut Filet" \
  --region eu \
  --language de \
  --variants 5 \
  --output-dir ./reports/ad-copy

QA runs automatically after generation. Add --skip-qa to bypass.

Architecture

Input

Variants (JSON)

↓

Gate 0 - Rule-based

Platform compliance, character limits, banned words, language detection, glossary enforcement, D2C geo-fence

$0/variant

↓ FAIL = stop, no AI review | PASS = continue

Gate 1 - Gemini 2.5 Flash

Brand voice, CTA clarity, audience match, cultural fit. Few-shot examples in prompt.

brand*0.30 + cta*0.25 + audience*0.25 + cultural*0.20

~$0.001/variant

↓

Gate 2 - GPT-4.1

Fluency, persuasion, platform fit.

fluency*0.35 + persuasion*0.35 + platform_fit*0.30

~$0.005/variant

↓

Gate 3 - Claude Sonnet 4.6

Semantic fidelity, register consistency, competitive differentiation. Back-translation check for non-English.

semantic*0.35 + register*0.35 + differentiation*0.30

~$0.008/variant

↓ Arbitration triggers? (borderline scores, reviewer disagreement)

Gate 4 - Claude Opus 4.6 (Conditional)

Reviews all 3 gate results. Renders verdict: PASS / NEEDS_REVIEW / FAIL. Double-weighted in final score.

~$0.04/variant

↓

Output

Score aggregation, issue deduplication, status determination. BatchQAResult (JSON + markdown reports).

Gate 0 - Platform Compliance

File: scripts/lib/qa_gate0_compliance.py Cost: $0 Model: None (rule-based)

Gate 0 runs first on every variant. If it fails, no AI gates run. This saves money on obviously non-compliant copy.

Checks performed

Check	What it validates
Required fields	All platform-required fields are present and non-empty
Character limits	Per-platform, per-field limits (see table below)
Banned words	Target-language equivalents of the COPY_AVOID list
Language detection	Stopword frequency heuristic confirms copy is in the target language
Glossary compliance	Product names not mangled, banned term alternatives absent
D2C geo-fence	Fish creative restricted to CA/WA/OR; no-discount geos enforced

Character limits by platform

Platform	Field	Limit
Meta	primary_text	125 chars
Meta	headline	40 chars
Meta	description	30 chars
Google RSA	headline	30 chars
Google RSA	description	90 chars
Klaviyo	subject	50 chars
Klaviyo	preview	90 chars
Klaviyo	body	2000 chars

Result type

@dataclass
class Gate0Result:
    passed: bool
    issues: list[Gate0Issue]
    checks_run: int
    checks_passed: int

@dataclass
class Gate0Issue:
    field: str       # Which field failed (e.g., "primary_text")
    check: str       # Check name (e.g., "char_limit")
    severity: str    # HIGH, MEDIUM, LOW
    problem: str     # Human-readable description
    suggestion: str  # Fix suggestion

Gate 1 - Gemini 2.5 Flash

File: scripts/lib/qa_gemini_client.py Cost: ~$0.001/variant Model: gemini-2.5-flash API key: GEMINI_API_KEY

Evaluates brand alignment and creative quality.

Dimensions (0-100 each)

Dimension	Weight	What it measures
`brand_voice`	0.30	Irreverent, chef-forward, "big meat energy" - NOT preachy/green/salad
`cta_clarity`	0.25	Clear, compelling, platform-appropriate call to action
`audience_match`	0.25	Targets flexitarians/meat-curious, NOT vegans/vegetarians
`cultural_fit`	0.20	For non-English: feels native, culturally adapted, not literally translated

Score formula: overall = brand_voice*0.30 + cta_clarity*0.25 + audience_match*0.25 + cultural_fit*0.20

The prompt includes 2 good + 2 bad few-shot examples per language/platform combination (Fix #3) and the terminology glossary for the target language.

Gate 2 - GPT-4.1

File: scripts/lib/qa_openai_client.py Cost: ~$0.005/variant Model: gpt-4.1 API key: OPENAI_API_KEY

Evaluates linguistic quality and persuasive effectiveness.

Dimensions (0-100 each)

Dimension	Weight	What it measures
`fluency`	0.35	Reads naturally in target language, no awkward phrasing
`persuasion`	0.35	Compelling enough to drive clicks/conversions
`platform_fit`	0.30	Follows platform best practices (Meta conversational vs Google keyword-rich)

Score formula: overall = fluency*0.35 + persuasion*0.35 + platform_fit*0.30

Gate 3 - Claude Sonnet 4.6

File: scripts/lib/qa_anthropic_client.py Cost: ~$0.008/variant Model: claude-sonnet-4-6 API key: ANTHROPIC_API_KEY

Evaluates semantic depth and cross-field consistency. For non-English copy, includes a back-translation check (Fix #4).

Dimensions (0-100 each)

Dimension	Weight	What it measures
`semantic_fidelity`	0.35	For translated copy: preserves emotional weight and persuasive force
`register_consistency`	0.35	Tone consistent across all variant fields (headline + primary_text + description)
`competitive_differentiation`	0.30	Stands out, avoids generic food marketing cliches

Score formula: overall = semantic_fidelity*0.35 + register_consistency*0.35 + competitive_differentiation*0.30

Back-translation (Fix #4)

For non-English copy, the Gate 3 prompt instructs Claude to:

Translate the ad copy back to English
Compare with the original brief/seed copy
Flag any semantic drift (meaning changed, emphasis shifted, CTA weakened)
Factor drift into the semantic_fidelity score

This happens within the same API call - no extra cost.

Gate 4 - Claude Opus 4.6 (Conditional)

File: scripts/lib/qa_anthropic_client.py Cost: ~$0.04/variant (only when triggered) Model: claude-opus-4-6 API key: ANTHROPIC_API_KEY

Opus arbitration fires only when reviewers disagree or scores fall in a borderline zone. It reviews all three previous gate results and renders a binding verdict.

Arbitration triggers

Gate 4 fires when ALL of these conditions are met:

Borderline zone: Any Gate 1/2/3 score is between 80-89, OR the spread between highest and lowest reviewer exceeds 15 points
Combined score range: The combined score before arbitration is between 75-94

If the combined score is already >= 95 (clearly excellent) or < 75 (clearly bad), Opus is skipped.

Verdict effects

Verdict	Effect
`PASS`	No change to status
`NEEDS_REVIEW`	Overrides PASSED to NEEDS_REVIEW
`FAIL`	Overrides any status to FAILED

The Opus score is double-weighted in the final aggregation formula.

Scoring and Pass/Fail Logic

Score aggregation

Without arbitration (most variants):

combined = round((Gate1 + Gate2 + Gate3) / 3)

With Opus arbitration:

combined = round((Gate1 + Gate2 + Gate3 + Gate4*2) / 5)

Thresholds

Combined Score	Status	What happens
>= 85	PASSED	Ready for Jon to deploy
75 - 84	NEEDS_REVIEW	Jon reviews manually
< 75	FAILED	Auto-rejected, regenerate

Blocking rules (override score thresholds)

Rule	Threshold	Why
High-severity issue	ANY present	Critical problems are non-negotiable
Medium-severity overload	> 2 issues	Too many quality concerns
Brand voice floor	< 70 from any reviewer	Brand is non-negotiable for Juicy Marbles
Audience match floor	< 75 from any reviewer	Wrong audience wastes ad spend

Severity classification

Severity	Examples
HIGH	Wrong language, banned word present, completely off-brand, factually incorrect product claims, missing CTA, wrong product/region
MEDIUM	Slightly generic tone, suboptimal word choice, borderline char limit, minor cultural awkwardness
LOW	Style preferences, regional variations, minor formatting

Score formula enforcement (Fix #6)

Every AI gate prompt requires the model to show its arithmetic using the weighted formula. After receiving the response, the pipeline recalculates the overall score from the dimension scores. If the model's overall deviates by more than 2 points from the formula, the pipeline overrides it with the calculated value and sets _score_override: true in the result.

Six TEEI Gap Fixes

Issues identified in the TEEI translation QA system that were fixed from the start in this pipeline.

Fix #1

Terminology Glossary

Locked product names and approved translations per language. Gate 0 enforces these rules before any AI review.

scripts/lib/qa_glossary.py

Fix #2

Issue Deduplication

After all gates run, issues are deduplicated by (field, category). Fuzzy matching merges similar issues. Unanimous flags (3+ models agree) get severity bumped up.

scripts/copy_qa_pipeline.py

Fix #3

Few-Shot Examples

Every AI gate prompt includes 2 good + 2 bad examples for the target language and platform, with scores and explanations.

scripts/lib/qa_prompts.py

Fix #4

Back-Translation Check

Built into Gate 3. For non-English copy, Claude translates back to English and compares with the original brief. Semantic drift affects the score.

Gate 3 prompt (no extra API call)

Fix #5

Cross-Variant Terminology

Within a batch, the pipeline tracks term choices across all variants for non-English copy and flags inconsistencies.

scripts/copy_qa_pipeline.py

Fix #6

Score Formula Enforcement

Post-processing recalculates scores from dimension values. Overrides if the model's score deviates by > 2 points. _score_override flag tracks this.

All gate clients

Integration Points

ad_copy_generator.py (post-generation QA)

After generate_copy_variants() produces variants, QA runs automatically:

from lib.copy_qa_pipeline import validate_batch

qa_result = validate_batch(
    variants,
    platform=platform,
    language=language,
    region=region,
    skip_ai=args.skip_qa,
)

The QA summary is included in the output JSON. Add --skip-qa to bypass QA entirely. QA is wrapped in try/except - if it fails, variants are still output without QA scores.

creative_refresh.py (replacing _score_variant)

The creative refresh loop uses the QA pipeline to score replacement copy:

from lib.copy_qa_pipeline import validate_batch

qa = validate_batch(
    replacements,
    platform="meta",
    language=language or "en",
    region=region or "us",
    skip_ai=skip_qa,
)

Variants scoring below the threshold (combined_score < 85) are filtered out. The existing --min-score 7 flag (on a 1-10 scale) maps to ~85 on the 0-100 QA scale.

copy_quality_scorer.py (foundation for Gate 0)

Gate 0 imports existing functions from copy_quality_scorer.py: CHAR_LIMITS, check_banned_words(), check_char_limits(). No code was duplicated.

CLI Usage

copy_qa_pipeline.py

usage: copy_qa_pipeline.py [-h] --input INPUT [--platform {meta,google,klaviyo}]
                           [--language LANGUAGE] [--region REGION]
                           [--output-dir OUTPUT_DIR] [--gate0-only]

options:
  --input          Directory containing ad copy JSON files, or a single JSON file
  --platform       Ad platform: meta, google, klaviyo (default: meta)
  --language       Language code: en, de, it, es (default: en)
  --region         Region code: us, eu (default: us)
  --output-dir     Directory to save QA results (default: ./reports/ad-copy/qa)
  --gate0-only     Run Gate 0 only - no API calls, no cost

Variant field names by platform

Platform	Required fields
Meta	`primary_text`, `headline`, `description`
Google RSA	`headlines` (list), `descriptions` (list)
Klaviyo	`subject`, `preview`, `body`

Output

The pipeline produces two files in the output directory:

qa_results_YYYYMMDD_HHMMSS.json - Full structured results (all gate scores, issues, blocking reasons)
qa_report_YYYYMMDD_HHMMSS.md - Human-readable markdown report

Makefile targets

make copy-qa       # Full pipeline (Meta, English, US)
make copy-qa-dry   # Gate 0 only (no API calls, $0)

Terminology Glossary

Universal terms (never translate)

Juicy Marbles Thick-Cut Filet Baby Ribs Umami Burger Kinda Cod Kinda Salmon Loin

Approved translations

English	German (de)	Italian (it)	Spanish (es)
plant-based	pflanzlich	vegetale	vegetal
flavor	Geschmack	sapore	sabor
enjoyment	Genuss	-	-

Banned alternatives (never use)

Language	Banned terms
German	veganes Fleisch, Fleischersatz, gesund, pflanzenbasiert
Italian	carne vegana, sostituto della carne, sano
Spanish	carne vegana, sustituto de carne, saludable

Cost Breakdown

Per variant

Gate	Model	Cost
0 (rule-based)	None	$0
1 (brand + tone)	Gemini 2.5 Flash	~$0.001
2 (fluency + persuasion)	GPT-4.1	~$0.005
3 (semantic depth)	Claude Sonnet 4.6	~$0.008
4 (arbitration, ~20% triggered)	Claude Opus 4.6	~$0.04

Per creative refresh cycle (15 variants)

Component	Cost
Gates 0-3 (all 15)	~$0.21
Gate 4 (~3 variants)	~$0.12
Total	~$0.33

Monthly cost (weekly refresh)

Period	Cost
Weekly	~$0.33
Monthly (4 cycles)	~$1.32

API Keys

Key	Required for	Env var
Gemini	Gate 1	`GEMINI_API_KEY`
OpenAI	Gate 2	`OPENAI_API_KEY`
Anthropic	Gates 3 + 4	`ANTHROPIC_API_KEY`

All keys are optional. If a key is missing, the corresponding gate logs an error and returns a zero score (it does not crash the pipeline). For --gate0-only / make copy-qa-dry, no API keys are needed.

File Reference

scripts/
  copy_qa_pipeline.py              # Orchestrator (790 LOC)
                                   # - BatchQAResult, VariantQAResult dataclasses
                                   # - deduplicate_issues() (Fix #2)
                                   # - check_terminology_consistency() (Fix #5)
                                   # - _aggregate_scores(), _determine_status()
                                   # - validate_variant(), validate_batch()
                                   # - generate_report() (JSON + markdown)
                                   # - CLI entry point (main)
  lib/
    __init__.py
    qa_gate0_compliance.py         # Gate 0: platform + language + glossary (~250 LOC)
    qa_gemini_client.py            # Gate 1: Gemini reviewer (~200 LOC)
    qa_openai_client.py            # Gate 2: GPT reviewer (~200 LOC)
    qa_anthropic_client.py         # Gate 3+4: Sonnet + Opus (~300 LOC)
    qa_retry_utils.py              # Retry infra (~276 LOC)
    qa_prompts.py                  # Prompt builders + few-shot (~400 LOC)
    qa_language_detect.py          # Stopword-based language detection (~228 LOC)
    qa_glossary.py                 # Locked terminology per language (~218 LOC)

Testing

All tests mock external API calls. No real API keys are needed to run the test suite.

# Run all tests
make test

# Run only QA pipeline tests
python -m pytest tests/test_copy_qa_pipeline.py tests/test_qa_*.py -v

Test coverage by module

Test file	Tests	Coverage
`test_copy_qa_pipeline.py`	~73	Orchestrator: scoring, dedup, terminology, batch, status, blocking rules
`test_qa_gate0_compliance.py`	~50	Gate 0: char limits, banned words, language detect, glossary, geo-fence
`test_qa_gemini_client.py`	32	Gate 1: response parsing, score coercion, batch, metadata, formula enforcement
`test_qa_openai_client.py`	34	Gate 2: response parsing, markdown stripping, batch partial failures, edge cases
`test_qa_anthropic_client.py`	52	Gates 3+4: parsing, back-translation, arbitration triggers, verdict handling
`test_qa_retry_utils.py`	~20	Retry: backoff, jitter, retryable detection, timeout
`test_qa_language_detect.py`	~30	Language detection: all 4 languages, short text, strong markers, edge cases

Data Types Reference

BatchQAResult

@dataclass
class BatchQAResult:
    variants: list[VariantQAResult]   # One result per input variant
    summary: dict                      # Aggregate counts
    terminology_report: dict           # Cross-variant term consistency
    generated_at: str                  # ISO timestamp

    # Properties:
    .passed        -> list[VariantQAResult]  # status PASSED
    .needs_review  -> list[VariantQAResult]  # status NEEDS_REVIEW
    .failed        -> list[VariantQAResult]  # status FAILED

VariantQAResult

@dataclass
class VariantQAResult:
    variant_index: int             # Position in input batch
    variant: dict                  # The original variant dict
    status: str                    # PASSED, NEEDS_REVIEW, FAILED
    combined_score: int            # 0-100 aggregate score
    gate0: Gate0Result | None      # Platform compliance result
    gate1: dict | None             # Gemini result
    gate2: dict | None             # GPT result
    gate3: dict | None             # Sonnet result
    gate4: dict | None             # Opus result (if arbitrated)
    issues: list[dict]             # Deduplicated issues from all gates
    blocking_reasons: list[str]    # Why it was blocked (if FAILED)
    arbitrated: bool               # Whether Gate 4 was triggered

Summary dict

{
    "total": 5,
    "passed": 3,
    "needs_review": 1,
    "failed": 1,
    "arbitrated": 2,
    "avg_score": 82,
    "terminology_inconsistencies": 0
}

Issue dict (after deduplication)

{
    "field": "primary_text",
    "severity": "MEDIUM",
    "category": "fluency",
    "problem": "Awkward phrasing",
    "suggestion": "Rephrase...",
    "_sources": ["gate1", "gate2"],
    "_flagged_by": 2,
    "_unanimous": false
}