Synthetic Data Generation Pipeline

The Problem

I needed training data for a home DIY domain. No suitable public dataset existed and manual labeling would have taken weeks. So I built a generation pipeline. The easy part was getting GPT-4o-mini to produce records. The hard part was catching subtle failures. Fields that pass type checks but contain nonsensical values. Descriptions that contradict their categories. Difficulty ratings that don’t match actual complexity. I needed a quality gate that could catch what schema validation misses.

Architecture

Templates v1 & v2

↓

Generator GPT-4o-mini + Instructor

↓

Validator Pydantic

↓

Evaluator GPT-4o Judge

↓

Analysis Pandas + Seaborn

↓ failure patterns

Corrector

feeds back to 1 2

↺ Feedback Loop

1 Corrector → Templates improved templates

2 Corrector → Evaluator fixed records

GPT-4o-mini generates the records. GPT-4o judges them. I used the Instructor library to wire Pydantic validation directly into the API calls so malformed responses get retried automatically instead of silently passing through. That one decision eliminated most parsing issues.

Key Results

All 30 records passed schema validation on the first attempt. But the real work was calibrating the judge. My first judge prompt approved everything, which meant it was useless. After four rounds of prompt refinement, I got it to a 20% failure rate that matched my own manual assessment. Dual labeling achieved 81.7% agreement between the LLM judge and my human labels.

V1 templates produced 36 failures across 180 evaluations. Failure analysis showed two dominant patterns: incomplete answers (50%) and poor quality tips (43%). Both concentrated in plumbing and HVAC categories. Electrical had zero failures because its template was already specific enough. Improved V2 templates cut failures to 8. Targeted fixes brought it to zero.

Key Decisions

Instructor over raw API calls. Pydantic validation built into the API call. Malformed responses get retried automatically. Saved me from writing a parsing layer.

Flat schema over nested. GPT-4o-mini struggled with deep nesting. It would hallucinate extra levels or collapse hierarchies. Flattening to single-level with string-encoded lists fixed generation quality and simplified the correction loop.

Judge calibration through iteration. A judge that approves everything is worthless. Four prompt versions, each adding specificity: field-level rules, cross-field consistency, domain plausibility. The 20% failure rate matched my manual review of a held-out sample.

Fix templates, not records. When the judge found systematic issues like difficulty ratings skewing “Easy,” I fixed the generation template rather than patching individual outputs. Four ADRs document each failure pattern and the template change that resolved it.

Known Gaps

No human-in-the-loop review. 81.7% agreement means about 18% of judge calls are wrong. Good enough for finding failure patterns in bulk. Not good enough for deciding whether a specific record passes. A review step between judge evaluation and correction would close this.

No drift monitoring. Templates degrade as the domain shifts. A random-sample re-evaluation on each pipeline run would catch this. The closed-loop pattern holds, but calibration needs to be continuous, not one-shot.

Key Metrics

The Problem

Architecture

Key Results

Key Decisions

Known Gaps