Show HN: Kepler SRE agent that does rootcause analysis for Incidents
I’m building an AI-powered SRE agent called Kepler to help engineers diagnose and explain production incidents faster. It integrates with systems like Prometheus, Loki, GitHub, and CI/CD pipelines to understand what changed, what broke, and why — and summarizes it in plain English.
The idea came from years of firefighting outages as a backend engineer and realizing how much time gets lost just figuring out where to look. We’ve all been there — four engineers, three dashboards, ten tabs open, and two hours gone before someone says: “oh, it’s that PR from this morning.”
Kepler automates root cause analysis by correlating alerts, logs, metrics, traces, and code diffs. Some of the features we’re experimenting with:
GitHub PR blame scoring against error logs
Top-log summarization using LLMs
Incident replay and RCA report generation
Slack-first UX for oncall usage
Live demo https://www.loom.com/share/78cdb41f6d504ef8b3457d4279ae170c
Website – https://meetkepler.com We’re planning to open source the core. Waitlist is open for anyone who wants to try it out.
We’re also exploring fine-tuning open-source reasoning models for infrastructure and SRE reasoning — with incident triage as our first focus.
It’s very early and rough around the edges, but I’d love your feedback:
What do you wish were automated during incidents?
What slows you down during triage?
Would you use a system like this during oncall?
If you want to jam or chat: https://calendly.com/meetkepler/30min
Thanks!
No comments yet