Copy QA Pipeline
Multi-Model Quality Assurance for Ad Copy
Overview
The pipeline validates every ad copy variant through up to five sequential gates before it reaches Jon for review. Each gate evaluates different quality dimensions using a different AI model, providing diverse perspectives that catch issues a single model would miss.
Non-blocking
QA failures never block ad copy generation. If the pipeline crashes, copy still gets produced (wrapped in try/except at all integration points).
Jon approves everything
QA results are recommendations. Variants marked NEEDS_REVIEW go to Jon for a human decision.
Fail fast
Gate 0 (rule-based, $0) runs first and blocks AI review if basic compliance fails. No money wasted on copy that violates character limits.
Score enforcement
Every AI model is told to show its arithmetic. Post-processing overrides deviant scores that don't match the weighted formula.
Cost-controlled
Gate 4 (Opus, the most expensive) only fires when reviewers disagree or scores fall in a borderline zone. Typical cost: ~$0.33 per 15-variant refresh cycle.
Quick Start
Dry run (Gate 0 only, no API calls)
make copy-qa-dry Or directly:
python scripts/copy_qa_pipeline.py \
--input ./reports/ad-copy \
--platform meta \
--language en \
--region us \
--gate0-only Full pipeline (requires API keys)
make copy-qa Or directly:
python scripts/copy_qa_pipeline.py \
--input ./reports/ad-copy \
--platform meta \
--language en \
--region us \
--output-dir ./reports/ad-copy/qa Generate + QA in one step
python scripts/ad_copy_generator.py generate \
--platform meta \
--product "Thick-Cut Filet" \
--region eu \
--language de \
--variants 5 \
--output-dir ./reports/ad-copy QA runs automatically after generation. Add --skip-qa to bypass.
Architecture
Gate 0 - Platform Compliance
Gate 0 runs first on every variant. If it fails, no AI gates run. This saves money on obviously non-compliant copy.
Checks performed
| Check | What it validates |
|---|---|
| Required fields | All platform-required fields are present and non-empty |
| Character limits | Per-platform, per-field limits (see table below) |
| Banned words | Target-language equivalents of the COPY_AVOID list |
| Language detection | Stopword frequency heuristic confirms copy is in the target language |
| Glossary compliance | Product names not mangled, banned term alternatives absent |
| D2C geo-fence | Fish creative restricted to CA/WA/OR; no-discount geos enforced |
Character limits by platform
| Platform | Field | Limit |
|---|---|---|
| Meta | primary_text | 125 chars |
| Meta | headline | 40 chars |
| Meta | description | 30 chars |
| Google RSA | headline | 30 chars |
| Google RSA | description | 90 chars |
| Klaviyo | subject | 50 chars |
| Klaviyo | preview | 90 chars |
| Klaviyo | body | 2000 chars |
Result type
@dataclass
class Gate0Result:
passed: bool
issues: list[Gate0Issue]
checks_run: int
checks_passed: int
@dataclass
class Gate0Issue:
field: str # Which field failed (e.g., "primary_text")
check: str # Check name (e.g., "char_limit")
severity: str # HIGH, MEDIUM, LOW
problem: str # Human-readable description
suggestion: str # Fix suggestion Gate 1 - Gemini 2.5 Flash
Evaluates brand alignment and creative quality.
Dimensions (0-100 each)
| Dimension | Weight | What it measures |
|---|---|---|
brand_voice | 0.30 | Irreverent, chef-forward, "big meat energy" - NOT preachy/green/salad |
cta_clarity | 0.25 | Clear, compelling, platform-appropriate call to action |
audience_match | 0.25 | Targets flexitarians/meat-curious, NOT vegans/vegetarians |
cultural_fit | 0.20 | For non-English: feels native, culturally adapted, not literally translated |
Score formula: overall = brand_voice*0.30 + cta_clarity*0.25 + audience_match*0.25 + cultural_fit*0.20
The prompt includes 2 good + 2 bad few-shot examples per language/platform combination (Fix #3) and the terminology glossary for the target language.
Gate 2 - GPT-4.1
Evaluates linguistic quality and persuasive effectiveness.
Dimensions (0-100 each)
| Dimension | Weight | What it measures |
|---|---|---|
fluency | 0.35 | Reads naturally in target language, no awkward phrasing |
persuasion | 0.35 | Compelling enough to drive clicks/conversions |
platform_fit | 0.30 | Follows platform best practices (Meta conversational vs Google keyword-rich) |
Score formula: overall = fluency*0.35 + persuasion*0.35 + platform_fit*0.30
Gate 3 - Claude Sonnet 4.6
Evaluates semantic depth and cross-field consistency. For non-English copy, includes a back-translation check (Fix #4).
Dimensions (0-100 each)
| Dimension | Weight | What it measures |
|---|---|---|
semantic_fidelity | 0.35 | For translated copy: preserves emotional weight and persuasive force |
register_consistency | 0.35 | Tone consistent across all variant fields (headline + primary_text + description) |
competitive_differentiation | 0.30 | Stands out, avoids generic food marketing cliches |
Score formula: overall = semantic_fidelity*0.35 + register_consistency*0.35 + competitive_differentiation*0.30
Back-translation (Fix #4)
For non-English copy, the Gate 3 prompt instructs Claude to:
- Translate the ad copy back to English
- Compare with the original brief/seed copy
- Flag any semantic drift (meaning changed, emphasis shifted, CTA weakened)
- Factor drift into the
semantic_fidelityscore
This happens within the same API call - no extra cost.
Gate 4 - Claude Opus 4.6 (Conditional)
Opus arbitration fires only when reviewers disagree or scores fall in a borderline zone. It reviews all three previous gate results and renders a binding verdict.
Arbitration triggers
Gate 4 fires when ALL of these conditions are met:
- Borderline zone: Any Gate 1/2/3 score is between 80-89, OR the spread between highest and lowest reviewer exceeds 15 points
- Combined score range: The combined score before arbitration is between 75-94
If the combined score is already >= 95 (clearly excellent) or < 75 (clearly bad), Opus is skipped.
Verdict effects
| Verdict | Effect |
|---|---|
PASS | No change to status |
NEEDS_REVIEW | Overrides PASSED to NEEDS_REVIEW |
FAIL | Overrides any status to FAILED |
The Opus score is double-weighted in the final aggregation formula.
Scoring and Pass/Fail Logic
Score aggregation
Without arbitration (most variants):
combined = round((Gate1 + Gate2 + Gate3) / 3) With Opus arbitration:
combined = round((Gate1 + Gate2 + Gate3 + Gate4*2) / 5) Thresholds
| Combined Score | Status | What happens |
|---|---|---|
| >= 85 | PASSED | Ready for Jon to deploy |
| 75 - 84 | NEEDS_REVIEW | Jon reviews manually |
| < 75 | FAILED | Auto-rejected, regenerate |
Blocking rules (override score thresholds)
| Rule | Threshold | Why |
|---|---|---|
| High-severity issue | ANY present | Critical problems are non-negotiable |
| Medium-severity overload | > 2 issues | Too many quality concerns |
| Brand voice floor | < 70 from any reviewer | Brand is non-negotiable for Juicy Marbles |
| Audience match floor | < 75 from any reviewer | Wrong audience wastes ad spend |
Severity classification
| Severity | Examples |
|---|---|
| HIGH | Wrong language, banned word present, completely off-brand, factually incorrect product claims, missing CTA, wrong product/region |
| MEDIUM | Slightly generic tone, suboptimal word choice, borderline char limit, minor cultural awkwardness |
| LOW | Style preferences, regional variations, minor formatting |
Score formula enforcement (Fix #6)
Every AI gate prompt requires the model to show its arithmetic using the weighted formula. After receiving the response, the pipeline recalculates the overall score from the dimension scores. If the model's overall deviates by more than 2 points from the formula, the pipeline overrides it with the calculated value and sets _score_override: true in the result.
Six TEEI Gap Fixes
Issues identified in the TEEI translation QA system that were fixed from the start in this pipeline.
Terminology Glossary
Locked product names and approved translations per language. Gate 0 enforces these rules before any AI review.
Issue Deduplication
After all gates run, issues are deduplicated by (field, category). Fuzzy matching merges similar issues. Unanimous flags (3+ models agree) get severity bumped up.
Few-Shot Examples
Every AI gate prompt includes 2 good + 2 bad examples for the target language and platform, with scores and explanations.
Back-Translation Check
Built into Gate 3. For non-English copy, Claude translates back to English and compares with the original brief. Semantic drift affects the score.
Cross-Variant Terminology
Within a batch, the pipeline tracks term choices across all variants for non-English copy and flags inconsistencies.
Score Formula Enforcement
Post-processing recalculates scores from dimension values. Overrides if the model's score deviates by > 2 points. _score_override flag tracks this.
Integration Points
ad_copy_generator.py (post-generation QA)
After generate_copy_variants() produces variants, QA runs automatically:
from lib.copy_qa_pipeline import validate_batch
qa_result = validate_batch(
variants,
platform=platform,
language=language,
region=region,
skip_ai=args.skip_qa,
) The QA summary is included in the output JSON. Add --skip-qa to bypass QA entirely. QA is wrapped in try/except - if it fails, variants are still output without QA scores.
creative_refresh.py (replacing _score_variant)
The creative refresh loop uses the QA pipeline to score replacement copy:
from lib.copy_qa_pipeline import validate_batch
qa = validate_batch(
replacements,
platform="meta",
language=language or "en",
region=region or "us",
skip_ai=skip_qa,
) Variants scoring below the threshold (combined_score < 85) are filtered out. The existing --min-score 7 flag (on a 1-10 scale) maps to ~85 on the 0-100 QA scale.
copy_quality_scorer.py (foundation for Gate 0)
Gate 0 imports existing functions from copy_quality_scorer.py: CHAR_LIMITS, check_banned_words(), check_char_limits(). No code was duplicated.
CLI Usage
copy_qa_pipeline.py
usage: copy_qa_pipeline.py [-h] --input INPUT [--platform {meta,google,klaviyo}]
[--language LANGUAGE] [--region REGION]
[--output-dir OUTPUT_DIR] [--gate0-only]
options:
--input Directory containing ad copy JSON files, or a single JSON file
--platform Ad platform: meta, google, klaviyo (default: meta)
--language Language code: en, de, it, es (default: en)
--region Region code: us, eu (default: us)
--output-dir Directory to save QA results (default: ./reports/ad-copy/qa)
--gate0-only Run Gate 0 only - no API calls, no cost Variant field names by platform
| Platform | Required fields |
|---|---|
| Meta | primary_text, headline, description |
| Google RSA | headlines (list), descriptions (list) |
| Klaviyo | subject, preview, body |
Output
The pipeline produces two files in the output directory:
qa_results_YYYYMMDD_HHMMSS.json- Full structured results (all gate scores, issues, blocking reasons)qa_report_YYYYMMDD_HHMMSS.md- Human-readable markdown report
Makefile targets
make copy-qa # Full pipeline (Meta, English, US)
make copy-qa-dry # Gate 0 only (no API calls, $0) Terminology Glossary
Universal terms (never translate)
Approved translations
| English | German (de) | Italian (it) | Spanish (es) |
|---|---|---|---|
| plant-based | pflanzlich | vegetale | vegetal |
| flavor | Geschmack | sapore | sabor |
| enjoyment | Genuss | - | - |
Banned alternatives (never use)
| Language | Banned terms |
|---|---|
| German | veganes Fleisch, Fleischersatz, gesund, pflanzenbasiert |
| Italian | carne vegana, sostituto della carne, sano |
| Spanish | carne vegana, sustituto de carne, saludable |
Cost Breakdown
Per variant
| Gate | Model | Cost |
|---|---|---|
| 0 (rule-based) | None | $0 |
| 1 (brand + tone) | Gemini 2.5 Flash | ~$0.001 |
| 2 (fluency + persuasion) | GPT-4.1 | ~$0.005 |
| 3 (semantic depth) | Claude Sonnet 4.6 | ~$0.008 |
| 4 (arbitration, ~20% triggered) | Claude Opus 4.6 | ~$0.04 |
Per creative refresh cycle (15 variants)
| Component | Cost |
|---|---|
| Gates 0-3 (all 15) | ~$0.21 |
| Gate 4 (~3 variants) | ~$0.12 |
| Total | ~$0.33 |
Monthly cost (weekly refresh)
| Period | Cost |
|---|---|
| Weekly | ~$0.33 |
| Monthly (4 cycles) | ~$1.32 |
API Keys
| Key | Required for | Env var |
|---|---|---|
| Gemini | Gate 1 | GEMINI_API_KEY |
| OpenAI | Gate 2 | OPENAI_API_KEY |
| Anthropic | Gates 3 + 4 | ANTHROPIC_API_KEY |
All keys are optional. If a key is missing, the corresponding gate logs an error and returns a zero score (it does not crash the pipeline). For --gate0-only / make copy-qa-dry, no API keys are needed.
File Reference
scripts/
copy_qa_pipeline.py # Orchestrator (790 LOC)
# - BatchQAResult, VariantQAResult dataclasses
# - deduplicate_issues() (Fix #2)
# - check_terminology_consistency() (Fix #5)
# - _aggregate_scores(), _determine_status()
# - validate_variant(), validate_batch()
# - generate_report() (JSON + markdown)
# - CLI entry point (main)
lib/
__init__.py
qa_gate0_compliance.py # Gate 0: platform + language + glossary (~250 LOC)
qa_gemini_client.py # Gate 1: Gemini reviewer (~200 LOC)
qa_openai_client.py # Gate 2: GPT reviewer (~200 LOC)
qa_anthropic_client.py # Gate 3+4: Sonnet + Opus (~300 LOC)
qa_retry_utils.py # Retry infra (~276 LOC)
qa_prompts.py # Prompt builders + few-shot (~400 LOC)
qa_language_detect.py # Stopword-based language detection (~228 LOC)
qa_glossary.py # Locked terminology per language (~218 LOC) Testing
All tests mock external API calls. No real API keys are needed to run the test suite.
# Run all tests
make test
# Run only QA pipeline tests
python -m pytest tests/test_copy_qa_pipeline.py tests/test_qa_*.py -v Test coverage by module
| Test file | Tests | Coverage |
|---|---|---|
test_copy_qa_pipeline.py | ~73 | Orchestrator: scoring, dedup, terminology, batch, status, blocking rules |
test_qa_gate0_compliance.py | ~50 | Gate 0: char limits, banned words, language detect, glossary, geo-fence |
test_qa_gemini_client.py | 32 | Gate 1: response parsing, score coercion, batch, metadata, formula enforcement |
test_qa_openai_client.py | 34 | Gate 2: response parsing, markdown stripping, batch partial failures, edge cases |
test_qa_anthropic_client.py | 52 | Gates 3+4: parsing, back-translation, arbitration triggers, verdict handling |
test_qa_retry_utils.py | ~20 | Retry: backoff, jitter, retryable detection, timeout |
test_qa_language_detect.py | ~30 | Language detection: all 4 languages, short text, strong markers, edge cases |
Data Types Reference
BatchQAResult
@dataclass
class BatchQAResult:
variants: list[VariantQAResult] # One result per input variant
summary: dict # Aggregate counts
terminology_report: dict # Cross-variant term consistency
generated_at: str # ISO timestamp
# Properties:
.passed -> list[VariantQAResult] # status PASSED
.needs_review -> list[VariantQAResult] # status NEEDS_REVIEW
.failed -> list[VariantQAResult] # status FAILED VariantQAResult
@dataclass
class VariantQAResult:
variant_index: int # Position in input batch
variant: dict # The original variant dict
status: str # PASSED, NEEDS_REVIEW, FAILED
combined_score: int # 0-100 aggregate score
gate0: Gate0Result | None # Platform compliance result
gate1: dict | None # Gemini result
gate2: dict | None # GPT result
gate3: dict | None # Sonnet result
gate4: dict | None # Opus result (if arbitrated)
issues: list[dict] # Deduplicated issues from all gates
blocking_reasons: list[str] # Why it was blocked (if FAILED)
arbitrated: bool # Whether Gate 4 was triggered Summary dict
{
"total": 5,
"passed": 3,
"needs_review": 1,
"failed": 1,
"arbitrated": 2,
"avg_score": 82,
"terminology_inconsistencies": 0
} Issue dict (after deduplication)
{
"field": "primary_text",
"severity": "MEDIUM",
"category": "fluency",
"problem": "Awkward phrasing",
"suggestion": "Rephrase...",
"_sources": ["gate1", "gate2"],
"_flagged_by": 2,
"_unanimous": false
}