Skip to main content

· Ruby Jha · project-deep-dives  · 3 min read

Building 9 AI Projects (While Working Full-Time)

Why I am building 9 AI systems from scratch while working full-time as an Engineering Manager. The portfolio, the progression, and what I have learned so far.

The reason

I have spent 20 years building through every major shift in enterprise software. Mainframe to client-server, on-prem to cloud, monolith to microservices. Each shift changed what engineering leaders needed to know. The current shift to AI is no different, except it is happening faster.

Every product is adding AI capabilities, every team needs people who understand these systems, and the engineering leaders who cannot build with AI will be managing work they do not understand. I did not want to be that leader. So I started building. Nine AI systems, each with evaluation metrics, architecture decision records, and documented tradeoffs.

The Portfolio Architecture

The 9 projects follow a deliberate progression through the applied AI stack:

  1. Data Generation (P1): Schema-driven synthetic data with Pydantic validation and LLM-as-Judge quality scoring
  2. Evaluation (P2): Multi-strategy RAG evaluation comparing 15 vector configurations across chunking, embeddings, and reranking
  3. Fine-Tuning (P3): Contrastive embedding fine-tuning: standard vs LoRA, head-to-head with comprehensive metrics
  4. Applied RAG (P4-P5): AI Resume Coach and a production RAG pipeline with hybrid search and FastAPI endpoints
  5. Multi-Agent Systems (P6-P9): Digital clone, feedback intelligence, Jira sprint planning, and DevOps RCA. All using CrewAI orchestration.

Each project builds on the previous. P1 generates data that could train models evaluated by P2’s framework. P3’s fine-tuned embeddings feed into P5’s production pipeline. P6-P9 all use multi-agent patterns that emerge naturally once you understand the single-agent limitations from P4-P5.

What separates these from tutorials

Every tutorial shows you how to build a RAG pipeline in 20 lines. That is not what these projects are. Each one has evaluation frameworks with baselines, architecture decision records including paths I did not take, error handling for when things break, real test coverage (P2 has 557 tests, P3 has 112), and deployment considerations documented in ADRs.

The easiest way to spot a tutorial project: ask “what happens when it fails?” These projects have an answer.

This site is also a project

rubyjha.dev is built with Astro 5.0, deployed on Cloudflare Pages, with content in MDX and custom components for metrics and architecture diagrams. Eventually it will include a RAG chatbot that answers questions about my work using these blog posts and project writeups as its knowledge base.

The series

This is post #1 in a 10-part series. Each project gets its own deep-dive with architecture diagrams, real metrics, and the decisions that shaped the system, including the ones that did not work on the first try.

Next up: how I built an LLM-as-Judge that went from approving everything to catching real failures, and why calibrating the judge turned out to be harder than building the generator.

All code is open source at github.com/rubsj/ai-portfolio.

RJ

Ruby Jha

Engineering Manager who builds. AI systems, enterprise products, and the teams that ship them.

Back to Blog

Related Posts

View all posts »
fine-tuning Mar 29, 2026

LoRA Hit 96% of Full Fine-Tuning. The Default Learning Rate Almost Killed It.

I fine-tuned all-MiniLM-L6-v2 on dating profiles, flipped Spearman from -0.22 to +0.85, and found LoRA hit 96.2% of that with 0.32% of parameters.

8 min read

rag Mar 21, 2026

I Tested 16 RAG Configs So You Don't Have To: Embedding Choice Matters More Than Chunk Size

Grid search across 16 RAG configurations reveals embedding model selection drives 26% more retrieval quality than chunk tuning.

9 min read

synthetic-data Feb 28, 2026

How I Calibrated an LLM Judge That Approved Everything

My first LLM judge had a 0% failure rate. That meant it was useless. This is the story of calibrating it to actually catch failures, and building a correction loop that took synthetic data failures from 36 to zero.

10 min read

leadership May 1, 2026

When Standups Feel Like Interrogations

How to diagnose whether tight oversight is a trust problem or a legitimate need, and how to hand back autonomy without losing accountability.

6 min read