DevOps Root-Cause Analysis Assistant

When a production incident hits, the on-call engineer is context-switching between CloudWatch logs, Datadog metrics, GitHub Actions runs, and Slack threads. They are building a mental model of what went wrong by correlating signals across systems. That process is slow, stressful, and depends entirely on who happens to be on call.

This project builds a multi-agent system that does the correlation automatically. Five CrewAI agents handle distinct jobs: pipeline monitoring (GitHub Actions, Jenkins), log analysis (CloudWatch, Elasticsearch, Datadog), error classification into a 13-category taxonomy, LLM-powered root cause diagnosis, and remediation suggestions with CLI commands for Kubernetes, Docker, and AWS.

The part that makes this more than a log aggregator is the remediation layer. Actions have risk levels (low, medium, high) and an approval workflow. Cache clears auto-execute in dev. Service restarts need approval in production. Rollbacks always need a human. The system learns from resolved incidents and builds a knowledge base of error signatures and runbook entries over time.

The goal is to cut mean time to resolution by giving the on-call engineer a structured hypothesis within 30 seconds instead of 15 minutes of manual investigation.