· Ruby Jha · project-deep-dives
How I Calibrated an LLM Judge That Approved Everything
My first LLM judge had a 0% failure rate. That meant it was useless. This is the story of calibrating it to actually catch failures, and building a correction loop that took synthetic data failures from 36 to zero.