Show HN: Agentic Reliability Framework – Multi-agent AI self-heals failures

3 points by petter2025us 10 hours ago

Hey HN! I'm Juan, former reliability engineer at NetApp where I handled 60+ critical incidents per month for Fortune 500 clients.

I built ARF after seeing the same pattern repeatedly: production AI systems fail silently, humans wake up at 3 AM, take 30-60 minutes to recover, and companies lose \$50K-\$250K per incident.

ARF uses 3 specialized AI agents:

Detective: Anomaly detection via FAISS vector memory Diagnostician: Root cause analysis with causal reasoning Predictive: Forecasts failures before they happen

Result: 2-minute MTTR (vs 45-minute manual), 15-30% revenue recovery.

Tech stack: Python 3.12, FAISS, SentenceTransformers, Gradio Tests: 157/158 passing (99.4% coverage) Docs: 42,000 words across 8 comprehensive files

Live demo: https://huggingface.co/spaces/petter2025/agentic-reliability...

The interesting technical challenge was making agents coordinate without tight coupling. Each agent is independently testable but orchestrated for holistic analysis.

Happy to answer questions about multi-agent systems, production reliability patterns, or FAISS for incident recall!

GitHub: https://github.com/petterjuan/agentic-reliability-framework

(Also available for consulting if you need this deployed in your infrastructure: https://lgcylabs.vercel.app/)