RAG Evaluation Pipeline

The Problem

Most RAG pipelines ship with zero evaluation beyond “it seems to work.” You pick a chunk size from a blog post, use whatever embedding model the tutorial used, and hope for the best. There is no way to know if semantic chunking actually beats fixed-size, whether OpenAI embeddings are worth the API cost over local models, or if reranking helps or just adds latency.

I wanted actual numbers. So I built a framework that tests every combination of chunking strategy, embedding model, and reranking against the same set of questions, tracks every result in Braintrust, and produces side-by-side comparisons.

Architecture

3 Markdown Documents

↓

Parser

↓

Chunking: 5 strategies

↓

Embedding: 3 models

↓

FAISS IndexFlatIP

BM25 Baseline

↓

56 Synthetic QA pairs

↓

Layer 1: Retrieval Metrics

↓

Layer 2: BM25 Comparison

↓

Layer 3: Cohere Reranking

↓

Layer 4: RAGAS Generation Quality

↓

Layer 5: LLM Judge

↓

Braintrust Experiments

12 Visualizations

GridSearchReport.json

16 configs come from crossing 5 chunking strategies with 3 embedding models, plus a BM25 baseline. Each config gets its own FAISS index. All 16 are evaluated against the same 56 synthetic QA pairs so comparisons are apples-to-apples. Every result is logged to Braintrust for reproducibility.

Key Results

Best config: semantic chunking with OpenAI embeddings (“E-openai”). 0.625 Recall@5 without reranking, 0.747 with Cohere reranking (+19.5%). BM25 keyword baseline managed 0.381.

Overlap is overrated. 50% overlap underperformed 25% overlap by 13%. Redundant chunks dilute the top-k instead of helping it. 10-25% is enough to prevent boundary fragmentation.

Embedding training data matters more than dimensionality. MPnet (768d) performed no better than MiniLM (384d). OpenAI (1536d) beat both by 26%. If the gap were about dimensions, MPnet should sit between the two. It does not. This is OpenAI’s supervised training data doing the work.

Reranking is the cheapest win. ~20% average lift for $0.05 per 168 reranks. The weakest configs got the biggest lift, which suggests reranking partially compensates for bad chunking or weak embeddings.

Always manually sample your LLM judge. The judge flagged 73% hallucination, but 22 of 41 flagged responses were the model refusing to answer (“I do not have enough context”). That is a prompt calibration issue, not a hallucination problem.

Retrieval quality does not guarantee generation quality. Best retrieval config scored 0.747 Recall@5 but only 0.511 faithfulness. The LLM generated answers that were not grounded in the retrieved context almost half the time. Good retrieval is necessary but not sufficient.

Key Decisions

FAISS over ChromaDB. Under 1,000 vectors, brute-force exact search is the correct choice. No server dependency, no approximate search, deterministic results. Eliminates a variable in evaluation.

Semantic chunking validated by grid search. I did not assume semantic chunking would win. Tested it head-to-head against fixed-size chunking across every embedding model. The 3-22% improvement was consistent. The grid search approach is the point: measure, do not guess.

OpenAI embeddings win on cost-quality. $0.02 per million tokens. Beats local sentence-transformer models by 26% on retrieval quality. Unless you are processing millions of documents, the API cost is negligible compared to the quality gap.

Cohere reranking as second stage. 19.5% Recall@5 improvement on the best config (0.625 to 0.747). Applied only to the top-k results, not the full corpus. Cheap and effective.

Known Gaps

Only tested on structured Markdown. Semantic chunking won because the input documents have clear header hierarchies. On unstructured text (PDFs, transcripts, scraped HTML), fixed-size chunking might perform equally well or better.

56 QA pairs is a small eval set. Enough to show directional differences between configs, but confidence intervals are wide. Production eval would need 200+ questions with human-verified ground truth.

51% faithfulness is low. Retrieval is working. Generation is not. I focused this project on the retrieval side deliberately. Prompt tuning for generation quality is the next layer.

Single-method retrieval only. I compared vector search and BM25 separately but did not implement hybrid retrieval (Reciprocal Rank Fusion). The data shows RRF would be the right production choice.

No latency benchmarking. I measured quality but not speed. For production, the embedding and reranking latency tradeoffs matter.

Key Metrics

The Problem

Architecture

Key Results

Key Decisions

Known Gaps