Contrastive Embedding Fine-Tuning

The Problem

The baseline all-MiniLM-L6-v2 model ranked incompatible dating profiles higher than compatible ones. Spearman was -0.22 on 295 eval pairs. AUC-ROC was 0.37, which is below coin-flip. The model had learned the opposite of compatibility. A single accuracy metric would not have caught this.

I needed to fix the embedding space and compare two approaches: full fine-tuning (all 22.7M parameters) versus LoRA (73K parameters, 0.32% of the model). Same training data, same loss function, same hyperparameters. Which one wins, and by how much?

Architecture

1,475 Training Pairs
295 Eval Pairs

↓

Data Loader

↓

Training
Mode

↓

Standard

Full Fine-Tuning
22.7M params
CosineSimilarityLoss

LoRA

LoRA Adapters
73K params, r=8
10x learning rate

↓

Generate Embeddings
del model + gc.collect

↓

8-Metric Evaluation
Spearman, Margin, AUC-ROC
Cohen's d, F1, Clustering

↓

Comparison Report
8 Charts + HTML + FP Analysis

Both approaches start from the same base model and train on identical sentence pairs using CosineSimilarityLoss. Standard fine-tuning updates all 22.7M parameters (86.7 MB model). LoRA injects rank-8 adapters on query/value layers, training only 73K parameters (0.28 MB adapter file). Evaluation uses 8 metrics so no single metric can mask a failure. Everything renders into a self-contained HTML comparison report.

Key Results

Baseline was inverted. Spearman -0.219. AUC-ROC 0.373. Similarity margin -0.083. The model rated dissimilar sentences as more similar than similar ones.

Standard fine-tuning fixed it: Spearman 0.853, AUC-ROC 0.994, margin flipped to +0.940. False positives dropped from 137 to 3.

LoRA hit 96.9% of standard across all metrics with 0.32% of the parameters. The adapter is 0.28 MB versus the full model’s 86.65 MB. Training took 35 seconds on an M2 MacBook versus 61 seconds for standard.

Key Decisions

LoRA is viable for small models. 96.9% of standard performance with a 309x smaller adapter. The 3.1% quality gap is measurable but acceptable for most applications. For models larger than 100M parameters, LoRA becomes the only practical option.

Skipped QLoRA. all-MiniLM-L6-v2 is 86 MB. It fits on any modern machine. 4-bit quantization would introduce precision loss for zero practical benefit.

CosineSimilarityLoss over Triplet/Contrastive. Directly optimizes the metric I evaluate on. Simpler data format (pairs, not triplets). No negative mining needed.

What Broke and What I Learned

LoRA needs 10x higher learning rate. My first LoRA attempt used the same learning rate as standard fine-tuning (2e-5) and barely moved from baseline. The fix was 2e-4. LoRA’s adapter matrices have far fewer parameters absorbing the gradient signal, so they need a more aggressive learning rate. This is not documented anywhere obvious in the PEFT or sentence-transformers docs. I found it through systematic hyperparameter search after the initial failure.

If I had only run LoRA with the default learning rate, I would have concluded it does not work for sentence transformer fine-tuning. Completely wrong conclusion from a single hyperparameter.

Adapter merge is a silent failure mode. LoRA achieved Spearman 0.827 during training, but post-training evaluation produced baseline-identical metrics. The adapter weights did not merge before encoding. The model runs without errors but produces unmodified embeddings. This is a real production risk with LoRA deployments that is invisible without comprehensive evaluation.

Constraints

8GB RAM on M2. Cannot load the training model and evaluation model simultaneously. The pipeline loads models sequentially with explicit del model + gc.collect() between stages. Without this, the process gets OOM-killed mid-evaluation.

1,475 training pairs. Small dataset by fine-tuning standards. Four epochs was the sweet spot. More epochs started overfitting.

No CUDA. CPU-only training on MacBook Air M2. Training times (~1 minute) are CPU-bound and not representative of GPU performance.

Known Gaps

One domain tested. Dating compatibility profiles with clear category labels. On unstructured text (job descriptions, product reviews), the same contrastive approach may need more data or different loss functions.

295 eval pairs. Enough to show the inversion and confirm the fix, but confidence intervals are wide. Production evaluation needs 1,000+ pairs with human-verified labels.

No hyperparameter sweep. 4 epochs, batch 16, warmup 100 based on sentence-transformers defaults and one manual round of tuning. A proper sweep likely finds a better configuration.

LoRA eval is incomplete. Training metrics confirm 0.827 Spearman, but post-training evaluation is blocked by the adapter merge bug. The LoRA comparison is based on training curves, not the full 8-metric suite.

Key Metrics