AI-Powered Resume Coach

The Problem

Resume advice is generic and untestable. “Improve your summary” is not actionable. There is no way to know whether one approach actually works better than another without controlled data.

I wanted to answer a specific question: what causes resumes to fail screening, and can you measure it? So I built a pipeline that generates 250 synthetic resumes across 5 fit levels, labels each one for 5 failure modes using deterministic rules (zero LLM calls, 250ms for all 250 pairs), and runs A/B tests across writing templates to find what actually matters.

Architecture

run_generation.py
50 jobs × 5 fit levels
= 250 resumes

↓

run_labeling.py
5 failure mode flags
Jaccard similarity

↓

judge.py
GPT-4o evaluation
quality score 0-1

↓

corrector.py
Instructor retry loop
8/8 corrected

↓

analyzer.py
9 charts
pipeline_summary.json

↓

multi_hop.py
4 reasoning questions
per pair

↓

vector_store.py
ChromaDB
all-MiniLM-L6-v2

↓

api.py
FastAPI
9 endpoints

↓

streamlit_app.py
5-page demo

Each module reads from JSONL files and writes output back to the same directory. Resumable at any stage. The labeler is pure Python with zero LLM calls: deterministic, 250ms for 250 pairs, fully unit-testable. The GPT-4o judge is optional (controlled by --skip-judge), keeping the fast path cheap.

A 4-stage SkillNormalizer canonicalizes skills before comparison. Without it, “Python 3.11” versus “Python” scores Jaccard=0.0 and the entire analysis breaks. A Day 1.5 audit found this exact bug: the pipeline reported 250/250 successes, but actual JSONL output showed skills generated as full sentences instead of tokens.

Key Results

casual achieved a 34% failure rate while career_changer hit 100%. That is a 66-percentage-point spread (chi-squared=32.74, p=1.35e-06). Fix the template before optimizing the content it generates.

Skill overlap (Jaccard similarity) forms a clean gradient across fit levels: excellent=0.669, good=0.607, partial=0.620, poor=0.212, mismatch=0.005. 58.4% of generated resumes contained awkward language, 50.8% were missing core skills. The GPT-4o judge averaged 0.541 quality score. Correction loop achieved 8/8 on flagged resumes.

Key Decisions

Instructor with nested schemas. 250 resumes with 35 nested Pydantic models, 100% validation rate. Instructor’s max_retries=5 with ValidationError injection means the model sees exactly which constraint failed. Same pattern from P1’s flat schema scaled unchanged to P4’s nested schema.

Custom SkillNormalizer. 4-stage pipeline: lowercase, version stripping, suffix removal, alias resolution. No third-party NLP library matched the alias rules I needed (“ML” to “machine learning”). The entire Jaccard gradient depends on this canonicalization.

Two-phase validation. Structural validation (Pydantic at generation time) separated from semantic validation (labeler post-generation). The labeler has zero LLM calls. Fully unit-testable with constructed fixtures.

FastAPI over Flask. 14 Pydantic schemas reused directly as endpoint types. Auto-generates Swagger at /docs with no additional code.

ChromaDB over FAISS. Same embedding model as P2, but ChromaDB was chosen for persistence across API restarts and metadata filtering. Different operational requirements than P2’s batch evaluation.

Known Gaps

250 resumes is a controlled dataset, not a large one. Enough to validate labeling logic and run a meaningful chi-squared test. Not enough to claim the template findings generalize. I would need 1,000+ with real job descriptions to test that.

Awkward language detection is keyword-based. The labeler flags phrases like “synergy” and “leverage” via a static list. It misses subtler AI tells like hedging patterns and over-qualification. A classifier trained on actual AI-generated versus human-written resumes would catch more.

Judge quality score of 0.541 is middling. I have not investigated whether this reflects genuine resume quality issues or judge calibration problems. Comparing against human ratings would clarify.

No latency benchmarking on the API. I measured pipeline quality but not endpoint response times under concurrent load.

Key Metrics

The Problem

Architecture

Key Results

Key Decisions

Known Gaps