Show HN: Kepler SRE agent that does rootcause analysis for Incidents

2 shunmugavel2609 0 6/3/2025, 5:00:36 PM loom.com ↗
Engineering leaders Care a lot about MTTR: the faster you mitigate and remediate an outage, the less user pain and revenue hit

I’m building an AI-powered SRE agent called Kepler to help engineers diagnose and explain production incidents faster. It integrates with systems like Prometheus, Loki, GitHub, and CI/CD pipelines to understand what changed, what broke, and why — and summarizes it in plain English.

The idea came from years of firefighting outages as a backend engineer and realizing how much time gets lost just figuring out where to look. We’ve all been there — four engineers, three dashboards, ten tabs open, and two hours gone before someone says: “oh, it’s that PR from this morning.”

Kepler automates root cause analysis by correlating alerts, logs, metrics, traces, and code diffs. Some of the features we’re experimenting with:

GitHub PR blame scoring against error logs

Top-log summarization using LLMs

Incident replay and RCA report generation

Slack-first UX for oncall usage

Live demo https://www.loom.com/share/78cdb41f6d504ef8b3457d4279ae170c

Website – https://meetkepler.com We’re planning to open source the core. Waitlist is open for anyone who wants to try it out.

We’re also exploring fine-tuning open-source reasoning models for infrastructure and SRE reasoning — with incident triage as our first focus.

It’s very early and rough around the edges, but I’d love your feedback:

What do you wish were automated during incidents?

What slows you down during triage?

Would you use a system like this during oncall?

If you want to jam or chat: https://calendly.com/meetkepler/30min

Thanks!

Comments (0)

No comments yet