AI Meets WinDBG

207 thunderbong 41 5/5/2025, 5:11:51 AM svnscha.de ↗

Comments (41)

lowleveldesign · 8h ago
I do a lot of Windows troubleshooting and still thinking about incorporating AI in my work. The posted project looks interesting and it's impressive how fast it was created. Since it's using MCP it should be possible to bind it with local models. I wonder how performant and effective it would be. When working in the debugger, you should be careful with what you send to the external servers (for example, Copilot). Process memory may contain unencrypted passwords, usernames, domain configuration, IP addresses, etc. Also, I don’t think that vibe-debugging will work without knowing what eax registry is or how to navigate stack/heap. It will solve some obvious problems, such as most exceptions, but for anything more demanding (bugs in application logic, race conditions, etc.), you will still need to get your hands dirty.

I am actually more interested in improving the debugger interface. For example, AI assistant could help me create breakpoint commands that nicely print function parameters when you only partly know the function signature and do not have symbols. I used Claude/Gemini for such tasks and they were pretty good at it.

As a side note, I recall Kevin Gosse also implemented a WinDbg extension [1][2] which used OpenAI API to interpret the debugger command output.

[1] https://x.com/KooKiz/status/1641565024765214720

[2] https://github.com/kevingosse/windbg-extensions

anougaret · 7h ago
this is pretty cool but ultimately it won't be enough to debug real bugs that are nested deep within business logic or happening because of long chains of events across multiple services/layers of the stack

imo what AI needs to debug is either:

- train with RL to use breakpoints + debugger or to do print debugging, but that'll suck because chains of action are super freaking long and also we know how it goes with AI memory currently, it's not great

- a sort of omniscient debugger always on that can inform the AI of all that the program/services did (sentry-like observability but on steroids). And then the AI would just search within that and find the root cause

none of the two approaches are going to be easy to make happen but imo if we all spend 10+ hours every week debugging that's worth the shot

that's why currently I'm working on approach 2. I made a time travel debugger/observability engine for JS/Python and I'm currently working on plugging it into AI context the most efficiently possible so it debugs even super long sequences of actions in dev & prod hopefully one day

it's super WIP and not self-hostable yet but if you want to check it out: https://ariana.dev/

indymike · 58m ago
> this is pretty cool but ultimately it won't be enough to debug real bugs that are nested deep within business logic

I'm looking at this as a better way to get the humans pointed in the right direction. Ariana.dev looks interesting!

Narishma · 30m ago
It's more likely to waste your time by pointing you in the wrong direction.
anougaret · 21m ago
hahaha yeah, even real developers cannot anticipate too well what the direction of a bug is on the first try
anougaret · 33m ago
yes can be a nice lightweight way to debug with a bit of AI other tools in that space will pbly be higher involvement
ehnto · 6h ago
I think you hit the nail on the head, especially for deeply embedded enterprise software. The long action chains/time taken to set up debugging scenarios is what makes debugging time consuming. Solving the inference side of things would be great, but I feel it takes too much knowledge not in the codebase OR the LLM to actually make an LLM useful once you are set up with a debugging state.

Like you said, running over a stream of events, states and data for that debugging scenario is probably way more helpful. It would also be great to prime the context with business rules and history for the company. Otherwise LLMs will make the same mistake devs make, not knowing the "why" something is and thinking the "what" is most important.

anougaret · 21m ago
thanks couldn't agree more :)
rafaelmn · 6h ago
Frankly this kind of stuff getting upvoted kind of makes HN less and less valuable as a news source - this is yet another "hey I trivially exposed something to the LLM and I got some funny results on a toy example".

These kind of demos were cool 2 years ago - then we got function calling in the API, it became super easy to build this stuff - and the reality hit that LLMs were kind of shit and unreliable at using even the most basic tools. Like oh woow you can get a toy example working on it and suddenly it's a "natural language interface to WinDBG".

I am excited about progress in this front in any domain - but FFS show actual progress or something interesting. Show me an article like this [1] where the LLM did anything useful. Or just show what you did that's not "oh I built a wrapper on a CLI" - did you fine tune the model to get better performance ? Did you compare which model performs better by setting up some benchmark and found one to be impressive ?

I am not shitting on OP here because it's fine to share what you're doing and get excited about it - maybe this is step one, but why the f** is this a front page article ?

[1]https://cookieplmonster.github.io/2025/04/23/gta-san-andreas...

anougaret · 6h ago
yeah it is still truly hard and rewarding to do deep, innovative software but everyone is regressing to the mean, rushing to low hanging fruits, and just plugging old A with new B in the hopes it makes them VC money or something

real, quality AI breakthrough in software creation & maintenance will require deep rework of many layers in the software stack, low and high level.

kevingadd · 4h ago
fwiw, WinDBG actually has support for time-travel debugging. I've used it before quite successfully, it's neat.
anougaret · 4h ago
usual limits of debuggers = barely usable to debug real scenarios
pjmlp · 41m ago
Since Borland days on MS-DOS they have served me pretty well in many real scenarios.

Usually what I keep bumping into, are people that never bothered to learn how to use their debuggers beyond the "introduction to debuggers" class, if any.

danielovichdk · 8h ago
Claiming to use WinDBG for debugging a crash dump and the only commands I can find in the MCP code are these ? I am not trying to be a dick here, but how does this really work under the covers ? Is the MCP learning windbg ? Is there a model that knows windbg ? I am asking becuase I have no idea.

        results["info"] = session.send_command(".lastevent")
        results["exception"] = session.send_command("!analyze -v")
        results["modules"] = session.send_command("lm")
        results["threads"] = session.send_command("~")
You cannot debug a crash dump only with these 4 commands, all the time.
psanchez · 7h ago
It looks like it is using "Microsoft Console Debugger (CDB)" as the interface to windbg.

Just had a quick look at the code: https://github.com/svnscha/mcp-windbg/blob/main/src/mcp_serv...

I might be wrong, but at first glance I don't think it is only using those 4 commands. It might be using them internally to get context to pass to the AI agent, but it looks like it exposes:

    - open_windbg_dump
    - run_windbg_cmd
    - close_windbg_dump
    - list_windbg_dumps
The most interesting one is "run_windbg_cmd" because it might allow the MCP server to send whatever the AI agent wants. E.g:

    elif name == "run_windbg_cmd":
        args = RunWindbgCmdParams(**arguments)
        session = get_or_create_session(
            args.dump_path, cdb_path, symbols_path, timeout, verbose
        )
        output = session.send_command(args.command)
        return [TextContent(
            type="text",
            text=f"Command: {args.command}\n\nOutput:\n```\n" + "\n".join(output) + "\n```"
        )]

(edit: formatting)
gustavoaca1997 · 7h ago
I think the magic happens in the function "run_windbg_cmd". AFAIK, the agent will use that function to pass any WinDBG command that the model thinks will be useful. The implementation basically includes the interface between the model and actually calling CDB through CDBSession.
eknkc · 5h ago
Yeah that seems correct. It's like creating an SQLite MCP server with single tool "run_sql". Which is just fine I guess as long as the LLM knows how to write SQL (or WinDBG commands). And they definitely do know that. I'd even say this is better because this shifts the capability to LLM instead of the MCP.
dark-star · 3h ago
The magic happens in the "analyze -v" part. This does quite a long analysis of a crashdump (https://learn.microsoft.com/en-us/windows-hardware/drivers/d...)

After that, all that is required is interpreting the results and connecting it with the source code.

Still impressive at first glance, but I wonder how well it works with a more complex example (like a crash in the Windows kernel due to a broken driver, for example)

JanneVee · 6h ago
> Crash dump analysis has traditionally been one of the most technically demanding and least enjoyable parts of software development.

I for one enjoy crashdump analysis because it is a technically demanding rare skill. I know I'm an exception but I enjoy actually learning the stuff so I can deterministically produce the desired result! I even apply it to other parts of the job, like learning to currently used programming language and actually reading the documentation libraries/frameworks, instead of copy pasting solutions from the "shortcut du jour" like stack overflow yesterday and LLMs of today!

criddell · 3h ago
Are you using WinDbg? What resources did you use to get really good at it?

Analyzing crash dumps is a small part of my job. I know enough to examine exception context records and associated stack traces and 80% of the time, that’s enough. Bruce Dawson’s blog has a lot of great stuff but it’s pretty advanced.

I’m looking for material to help me jump that gap.

JanneVee · 2h ago
I didn't say that I was any good, just that I enjoyed it.

I have a dog eared copy of Advanced Windows Debugging that I've used, but I've also have books around reverse engineering, disassembly and a little bit of curiosity and practice. I have also the .NET version which I haven't used as much. I also enjoyed the Vostokov books, even though there is a lack of editing in them.

Edit to add: It is not as much focus on usage of the tool as it is about understanding what is going on in the dump file, you are ahead in knowledge if you can do stack traces and look at exception records.

the_duke · 2h ago
I feel like current top models (Gemini Pro 2.5 etc) would already be good developers if they had the feedback cycle and capabilities that real developers have:

* reading the whole source code

* looking up dependency documentation and code, search related blog posts

* getting compilation/linter warnings ands errors

* Running tests

* Running the application and validating output (eg, for a webserver, start the server, send requests, get the response)

The tooling is slowly catching up, and you can enable a bunch of this already with MCP servers, but we are nowhere near the optimum yet.

Expect significant improvements in the near future, even if the models don't get better.

thegeomaster · 2h ago
This is exactly what frameworks like Claude Code, OpenAI Codex, Cursor agent mode, OpenHands, SWE-Agent, Devin, and others do.

It definitely does allow models to do more.

However, the high-level planning, reflection and executive function still aren't there. LLMs can nowadays navigate very complex tasks using "intuition": just ask them to do the task, give them tools, and they do a good job. But if the task is too long or requires too much information, the context length deteriorates the performance significantly, so you have to switch to a multi-step pipeline with multiple levels of execution.

This is, perhaps unexpectedly, where things start breaking down. Having the LLM write down a plan lossily compresses the "intuition", and LLMs (yes, even Gemini 2.5 Pro) cannot understand what's important to include in such a grand plan, how to predict possible externalities, etc. This is a managerial skill and seems distinct from closed-form coding, which you can always RL towards.

Errors, omissions, and assumptions baked into the plan get multiplied many times over by the subsequent steps that follow the plan. Sometimes, the plan heavily depends on the outcome of some of the execution steps ("investigate if we can..."). Allowing the "execution" LLM to go back and alter the plan results in total chaos, but following the plan rigidly leads to unexpectedly stupid issues, where the execution LLM is trying to follow flawed steps, sometimes even recognizing that they are flawed and trying to self-correct inappropriately.

In short, we're still waiting for an LLM which can keep track of high-level task context and effectively steer and schedule lower-level agents to complete a general task on a larger time horizon.

For a more intuitive example, see how current agentic browser use tools break down when they need to complete a complex, multi-step task. Or just ask Claude Code to do a feature in your existing codebase (that is not simple CRUD) the way you'd tell a junior dev.

pjmlp · 37m ago
I expect that if I use the way I would tell an offshoring junior dev, to the way that I actually get a swing instead of a tire, then it will get quite close to the desired outcome.

However, this usually takes much more effort than just doing the damm thing myself.

demarq · 2h ago
It’s now a matter of when, and I’m working on that problem.
JanSchu · 3h ago
This is one of the most exciting and practical applications of AI tooling I've seen in a long time. Crash dump analysis has always felt like the kind of task that time forgot—vital, intricate, and utterly user-hostile. Your approach bridges a massive usability gap with the exact right philosophy: augment, don't replace.

A few things that stand out:

The use of MCP to connect CDB with Copilot is genius. Too often, AI tooling is skin-deep—just a chat overlay that guesses at output. You've gone much deeper by wiring actual tool invocations to AI cognition. This feels like the future of all expert tooling.

You nailed the problem framing. It’s not about eliminating expertise—it’s about letting the expert focus on analysis instead of syntax and byte-counting. Having AI interpret crash dumps is like going from raw SQL to a BI dashboard—with the option to drop down if needed.

Releasing it open-source is a huge move. You just laid the groundwork for a whole new ecosystem. I wouldn’t be surprised if this becomes a standard debug layer for large codebases, much like Sentry or Crashlytics became for telemetry.

If Microsoft is smart, they should be building this into VS proper—or at least hiring you to do it.

Curious: have you thought about extending this beyond crash dumps? I could imagine similar integrations for static analysis, exploit triage, or even live kernel debugging with conversational AI support.

Amazing work. Bookmarked, starred, and vibed.

Helmut10001 · 2h ago
I have noticed a lot of improvements in this area too. I recently had a problem with my site-to-site IPSEC connection. I had an LLM explain the logs from both sides and together we came to a conclusion. Distilling the problematic part from the huge logs was a significant effort and time saver.
lgiordano_notte · 2h ago
Curious how you're handling multi-step flows or follow-ups, seems like thats where MCP could really shine especially compared to brittle CLI scripts. We've seen similar wins with browser agents once structured actions and context are in place.
codepathfinder · 2h ago
Built this around 2023 mid and found interesting results!
cadamsdotcom · 9h ago
Author built an MCP server for windbg: https://github.com/svnscha/mcp-windbg

Knows plenty of arcane commands in addition to the common ones, which is really cool & lets it do amazing things for you, the user.

To the author: most of your audience knows what MCP is, may I suggest adding a tl;dr to help people quickly understand what you've done?

Tepix · 7h ago
Sounds really neat!

How does it compare to using the Ghidra MCP server?

cjbprime · 6h ago
Ghidra's a decompiler and WinDBG is a debugger, so they'd be complementary.
indigodaddy · 8h ago
My word, that's one of the most beautiful sites I've ever encountered on mobile.

No comments yet

Zebfross · 9h ago
Considering AI is trained on the average human experience, I have a hard time believing it would be able to make any significant difference in this area. The best experience I’ve had debugging at this level was using Microsoft’s time travel debugger which allows stepping forward and back.
cjbprime · 6h ago
You should try AI sometime. It's quite good, and can do things (like "analyze these 10000 functions and summarize what you found out about how this binary works, including adding comments everywhere) that individual humans do not scale to.
voidspark · 5h ago
It can analyze a crash dump in 2 seconds, that could take hours for an experienced developer, or impossible for the "average human".