You should check out ChatDBG project - which AFAICT goes much further than this work, though in a different direction, and which, among other things, lets the LLM drive the debugging process - has been out since early 2023. We initially did a WinDBG integration but have since focused on lldb/gdb and pdb (the Python debugger), especially for Python notebooks. In particular, for native code, it integrates a language server to let the LLM easily find declarations and references to variables, for example. We spent considerable time developing an API that enabled the LLM to make the best use of the debugger’s capabilities. (It also is not limited to post mortem debugging). ChatDBG’s been out since 2023, though it has of course evolved since that time. Code is here [1] with some videos; it’s been downloaded north of 80K times to date. Our technical paper [2] will be presented at FSE (top software engineering conference) in June. Our evaluation shows that ChatDBG is on its own able to resolve many issues, and that with some slight nudging from humans, it is even more effective.
Is the benefit of using a language server as opposed to just giving access to the codebase simply a reduction in the amount of tokens used? Or are there other benefits?
lowleveldesign · 8h ago
I do a lot of Windows troubleshooting and still thinking about incorporating AI in my work. The posted project looks interesting and it's impressive how fast it was created. Since it's using MCP it should be possible to bind it with local models. I wonder how performant and effective it would be. When working in the debugger, you should be careful with what you send to the external servers (for example, Copilot). Process memory may contain unencrypted passwords, usernames, domain configuration, IP addresses, etc. Also, I don’t think that vibe-debugging will work without knowing what eax registry is or how to navigate stack/heap. It will solve some obvious problems, such as most exceptions, but for anything more demanding (bugs in application logic, race conditions, etc.), you will still need to get your hands dirty.
I am actually more interested in improving the debugger interface. For example, AI assistant could help me create breakpoint commands that nicely print function parameters when you only partly know the function signature and do not have symbols. I used Claude/Gemini for such tasks and they were pretty good at it.
As a side note, I recall Kevin Gosse also implemented a WinDbg extension [1][2] which used OpenAI API to interpret the debugger command output.
this is pretty cool but ultimately it won't be enough to debug real bugs that are nested deep within business logic or happening because of long chains of events across multiple services/layers of the stack
imo what AI needs to debug is either:
- train with RL to use breakpoints + debugger or to do print debugging, but that'll suck because chains of action are super freaking long and also we know how it goes with AI memory currently, it's not great
- a sort of omniscient debugger always on that can inform the AI of all that the program/services did (sentry-like observability but on steroids). And then the AI would just search within that and find the root cause
none of the two approaches are going to be easy to make happen but imo if we all spend 10+ hours every week debugging that's worth the shot
that's why currently I'm working on approach 2. I made a time travel debugger/observability engine for JS/Python and I'm currently working on plugging it into AI context the most efficiently possible so it debugs even super long sequences of actions in dev & prod hopefully one day
it's super WIP and not self-hostable yet but if you want to check it out: https://ariana.dev/
indymike · 1h ago
> this is pretty cool but ultimately it won't be enough to debug real bugs that are nested deep within business logic
I'm looking at this as a better way to get the humans pointed in the right direction. Ariana.dev looks interesting!
Narishma · 53m ago
It's more likely to waste your time by pointing you in the wrong direction.
anougaret · 44m ago
hahaha yeah, even real developers cannot anticipate too well what the direction of a bug is on the first try
anougaret · 57m ago
yes can be a nice lightweight way to debug with a bit of AI
other tools in that space will pbly be higher involvement
ehnto · 7h ago
I think you hit the nail on the head, especially for deeply embedded enterprise software. The long action chains/time taken to set up debugging scenarios is what makes debugging time consuming. Solving the inference side of things would be great, but I feel it takes too much knowledge not in the codebase OR the LLM to actually make an LLM useful once you are set up with a debugging state.
Like you said, running over a stream of events, states and data for that debugging scenario is probably way more helpful. It would also be great to prime the context with business rules and history for the company. Otherwise LLMs will make the same mistake devs make, not knowing the "why" something is and thinking the "what" is most important.
anougaret · 44m ago
thanks couldn't agree more :)
rafaelmn · 7h ago
Frankly this kind of stuff getting upvoted kind of makes HN less and less valuable as a news source - this is yet another "hey I trivially exposed something to the LLM and I got some funny results on a toy example".
These kind of demos were cool 2 years ago - then we got function calling in the API, it became super easy to build this stuff - and the reality hit that LLMs were kind of shit and unreliable at using even the most basic tools. Like oh woow you can get a toy example working on it and suddenly it's a "natural language interface to WinDBG".
I am excited about progress in this front in any domain - but FFS show actual progress or something interesting. Show me an article like this [1] where the LLM did anything useful. Or just show what you did that's not "oh I built a wrapper on a CLI" - did you fine tune the model to get better performance ? Did you compare which model performs better by setting up some benchmark and found one to be impressive ?
I am not shitting on OP here because it's fine to share what you're doing and get excited about it - maybe this is step one, but why the f** is this a front page article ?
yeah it is still truly hard and rewarding to do deep, innovative software
but everyone is regressing to the mean, rushing to low hanging fruits, and just plugging old A with new B in the hopes it makes them VC money or something
real, quality AI breakthrough in software creation & maintenance will require deep rework of many layers in the software stack, low and high level.
kevingadd · 5h ago
fwiw, WinDBG actually has support for time-travel debugging. I've used it before quite successfully, it's neat.
anougaret · 4h ago
usual limits of debuggers = barely usable to debug real scenarios
pjmlp · 1h ago
Since Borland days on MS-DOS they have served me pretty well in many real scenarios.
Usually what I keep bumping into, are people that never bothered to learn how to use their debuggers beyond the "introduction to debuggers" class, if any.
danielovichdk · 8h ago
Claiming to use WinDBG for debugging a crash dump and the only commands I can find in the MCP code are these ? I am not trying to be a dick here, but how does this really work under the covers ? Is the MCP learning windbg ? Is there a model that knows windbg ? I am asking becuase I have no idea.
I might be wrong, but at first glance I don't think it is only using those 4 commands. It might be using them internally to get context to pass to the AI agent, but it looks like it exposes:
I think the magic happens in the function "run_windbg_cmd". AFAIK, the agent will use that function to pass any WinDBG command that the model thinks will be useful. The implementation basically includes the interface between the model and actually calling CDB through CDBSession.
eknkc · 6h ago
Yeah that seems correct. It's like creating an SQLite MCP server with single tool "run_sql". Which is just fine I guess as long as the LLM knows how to write SQL (or WinDBG commands). And they definitely do know that. I'd even say this is better because this shifts the capability to LLM instead of the MCP.
After that, all that is required is interpreting the results and connecting it with the source code.
Still impressive at first glance, but I wonder how well it works with a more complex example (like a crash in the Windows kernel due to a broken driver, for example)
JanneVee · 6h ago
> Crash dump analysis has traditionally been one of the most technically demanding and least enjoyable parts of software development.
I for one enjoy crashdump analysis because it is a technically demanding rare skill. I know I'm an exception but I enjoy actually learning the stuff so I can deterministically produce the desired result! I even apply it to other parts of the job, like learning to currently used programming language and actually reading the documentation libraries/frameworks, instead of copy pasting solutions from the "shortcut du jour" like stack overflow yesterday and LLMs of today!
criddell · 4h ago
Are you using WinDbg? What resources did you use to get really good at it?
Analyzing crash dumps is a small part of my job. I know enough to examine exception context records and associated stack traces and 80% of the time, that’s enough. Bruce Dawson’s blog has a lot of great stuff but it’s pretty advanced.
I’m looking for material to help me jump that gap.
JanneVee · 2h ago
I didn't say that I was any good, just that I enjoyed it.
I have a dog eared copy of Advanced Windows Debugging that I've used, but I've also have books around reverse engineering, disassembly and a little bit of curiosity and practice. I have also the .NET version which I haven't used as much. I also enjoyed the Vostokov books, even though there is a lack of editing in them.
Edit to add: It is not as much focus on usage of the tool as it is about understanding what is going on in the dump file, you are ahead in knowledge if you can do stack traces and look at exception records.
the_duke · 3h ago
I feel like current top models (Gemini Pro 2.5 etc) would already be good developers if they had the feedback cycle and capabilities that real developers have:
* reading the whole source code
* looking up dependency documentation and code, search related blog posts
* getting compilation/linter warnings ands errors
* Running tests
* Running the application and validating output (eg, for a webserver, start the server, send requests, get the response)
The tooling is slowly catching up, and you can enable a bunch of this already with MCP servers, but we are nowhere near the optimum yet.
Expect significant improvements in the near future, even if the models don't get better.
thegeomaster · 2h ago
This is exactly what frameworks like Claude Code, OpenAI Codex, Cursor agent mode, OpenHands, SWE-Agent, Devin, and others do.
It definitely does allow models to do more.
However, the high-level planning, reflection and executive function still aren't there. LLMs can nowadays navigate very complex tasks using "intuition": just ask them to do the task, give them tools, and they do a good job. But if the task is too long or requires too much information, the context length deteriorates the performance significantly, so you have to switch to a multi-step pipeline with multiple levels of execution.
This is, perhaps unexpectedly, where things start breaking down. Having the LLM write down a plan lossily compresses the "intuition", and LLMs (yes, even Gemini 2.5 Pro) cannot understand what's important to include in such a grand plan, how to predict possible externalities, etc. This is a managerial skill and seems distinct from closed-form coding, which you can always RL towards.
Errors, omissions, and assumptions baked into the plan get multiplied many times over by the subsequent steps that follow the plan. Sometimes, the plan heavily depends on the outcome of some of the execution steps ("investigate if we can..."). Allowing the "execution" LLM to go back and alter the plan results in total chaos, but following the plan rigidly leads to unexpectedly stupid issues, where the execution LLM is trying to follow flawed steps, sometimes even recognizing that they are flawed and trying to self-correct inappropriately.
In short, we're still waiting for an LLM which can keep track of high-level task context and effectively steer and schedule lower-level agents to complete a general task on a larger time horizon.
For a more intuitive example, see how current agentic browser use tools break down when they need to complete a complex, multi-step task. Or just ask Claude Code to do a feature in your existing codebase (that is not simple CRUD) the way you'd tell a junior dev.
pjmlp · 1h ago
I expect that if I use the way I would tell an offshoring junior dev, to the way that I actually get a swing instead of a tire, then it will get quite close to the desired outcome.
However, this usually takes much more effort than just doing the damm thing myself.
demarq · 2h ago
It’s now a matter of when, and I’m working on that problem.
lgiordano_notte · 3h ago
Curious how you're handling multi-step flows or follow-ups, seems like thats where MCP could really shine especially compared to brittle CLI scripts. We've seen similar wins with browser agents once structured actions and context are in place.
codepathfinder · 2h ago
Built this around 2023 mid and found interesting results!
Knows plenty of arcane commands in addition to the common ones, which is really cool & lets it do amazing things for you, the user.
To the author: most of your audience knows what MCP is, may I suggest adding a tl;dr to help people quickly understand what you've done?
Tepix · 7h ago
Sounds really neat!
How does it compare to using the Ghidra MCP server?
trealira · 8m ago
This isn't a decompiler, but there are LLM tools for decompilation, like LLM4Decompile.
cjbprime · 6h ago
Ghidra's a decompiler and WinDBG is a debugger, so they'd be complementary.
indigodaddy · 9h ago
My word, that's one of the most beautiful sites I've ever encountered on mobile.
No comments yet
Zebfross · 9h ago
Considering AI is trained on the average human experience, I have a hard time believing it would be able to make any significant difference in this area. The best experience I’ve had debugging at this level was using Microsoft’s time travel debugger which allows stepping forward and back.
cjbprime · 6h ago
You should try AI sometime. It's quite good, and can do things (like "analyze these 10000 functions and summarize what you found out about how this binary works, including adding comments everywhere) that individual humans do not scale to.
voidspark · 5h ago
It can analyze a crash dump in 2 seconds, that could take hours for an experienced developer, or impossible for the "average human".
JanSchu · 3h ago
This is one of the most exciting and practical applications of AI tooling I've seen in a long time. Crash dump analysis has always felt like the kind of task that time forgot—vital, intricate, and utterly user-hostile. Your approach bridges a massive usability gap with the exact right philosophy: augment, don't replace.
A few things that stand out:
The use of MCP to connect CDB with Copilot is genius. Too often, AI tooling is skin-deep—just a chat overlay that guesses at output. You've gone much deeper by wiring actual tool invocations to AI cognition. This feels like the future of all expert tooling.
You nailed the problem framing. It’s not about eliminating expertise—it’s about letting the expert focus on analysis instead of syntax and byte-counting. Having AI interpret crash dumps is like going from raw SQL to a BI dashboard—with the option to drop down if needed.
Releasing it open-source is a huge move. You just laid the groundwork for a whole new ecosystem. I wouldn’t be surprised if this becomes a standard debug layer for large codebases, much like Sentry or Crashlytics became for telemetry.
If Microsoft is smart, they should be building this into VS proper—or at least hiring you to do it.
Curious: have you thought about extending this beyond crash dumps? I could imagine similar integrations for static analysis, exploit triage, or even live kernel debugging with conversational AI support.
Amazing work. Bookmarked, starred, and vibed.
Helmut10001 · 3h ago
I have noticed a lot of improvements in this area too. I recently had a problem with my site-to-site IPSEC connection. I had an LLM explain the logs from both sides and together we came to a conclusion. Distilling the problematic part from the huge logs was a significant effort and time saver.
[1] https://github.com/plasma-umass/ChatDBG (north of 75K downloads to date) [2] https://arxiv.org/abs/2403.16354
I am actually more interested in improving the debugger interface. For example, AI assistant could help me create breakpoint commands that nicely print function parameters when you only partly know the function signature and do not have symbols. I used Claude/Gemini for such tasks and they were pretty good at it.
As a side note, I recall Kevin Gosse also implemented a WinDbg extension [1][2] which used OpenAI API to interpret the debugger command output.
[1] https://x.com/KooKiz/status/1641565024765214720
[2] https://github.com/kevingosse/windbg-extensions
imo what AI needs to debug is either:
- train with RL to use breakpoints + debugger or to do print debugging, but that'll suck because chains of action are super freaking long and also we know how it goes with AI memory currently, it's not great
- a sort of omniscient debugger always on that can inform the AI of all that the program/services did (sentry-like observability but on steroids). And then the AI would just search within that and find the root cause
none of the two approaches are going to be easy to make happen but imo if we all spend 10+ hours every week debugging that's worth the shot
that's why currently I'm working on approach 2. I made a time travel debugger/observability engine for JS/Python and I'm currently working on plugging it into AI context the most efficiently possible so it debugs even super long sequences of actions in dev & prod hopefully one day
it's super WIP and not self-hostable yet but if you want to check it out: https://ariana.dev/
I'm looking at this as a better way to get the humans pointed in the right direction. Ariana.dev looks interesting!
Like you said, running over a stream of events, states and data for that debugging scenario is probably way more helpful. It would also be great to prime the context with business rules and history for the company. Otherwise LLMs will make the same mistake devs make, not knowing the "why" something is and thinking the "what" is most important.
These kind of demos were cool 2 years ago - then we got function calling in the API, it became super easy to build this stuff - and the reality hit that LLMs were kind of shit and unreliable at using even the most basic tools. Like oh woow you can get a toy example working on it and suddenly it's a "natural language interface to WinDBG".
I am excited about progress in this front in any domain - but FFS show actual progress or something interesting. Show me an article like this [1] where the LLM did anything useful. Or just show what you did that's not "oh I built a wrapper on a CLI" - did you fine tune the model to get better performance ? Did you compare which model performs better by setting up some benchmark and found one to be impressive ?
I am not shitting on OP here because it's fine to share what you're doing and get excited about it - maybe this is step one, but why the f** is this a front page article ?
[1]https://cookieplmonster.github.io/2025/04/23/gta-san-andreas...
real, quality AI breakthrough in software creation & maintenance will require deep rework of many layers in the software stack, low and high level.
Usually what I keep bumping into, are people that never bothered to learn how to use their debuggers beyond the "introduction to debuggers" class, if any.
Just had a quick look at the code: https://github.com/svnscha/mcp-windbg/blob/main/src/mcp_serv...
I might be wrong, but at first glance I don't think it is only using those 4 commands. It might be using them internally to get context to pass to the AI agent, but it looks like it exposes:
The most interesting one is "run_windbg_cmd" because it might allow the MCP server to send whatever the AI agent wants. E.g: (edit: formatting)After that, all that is required is interpreting the results and connecting it with the source code.
Still impressive at first glance, but I wonder how well it works with a more complex example (like a crash in the Windows kernel due to a broken driver, for example)
I for one enjoy crashdump analysis because it is a technically demanding rare skill. I know I'm an exception but I enjoy actually learning the stuff so I can deterministically produce the desired result! I even apply it to other parts of the job, like learning to currently used programming language and actually reading the documentation libraries/frameworks, instead of copy pasting solutions from the "shortcut du jour" like stack overflow yesterday and LLMs of today!
Analyzing crash dumps is a small part of my job. I know enough to examine exception context records and associated stack traces and 80% of the time, that’s enough. Bruce Dawson’s blog has a lot of great stuff but it’s pretty advanced.
I’m looking for material to help me jump that gap.
I have a dog eared copy of Advanced Windows Debugging that I've used, but I've also have books around reverse engineering, disassembly and a little bit of curiosity and practice. I have also the .NET version which I haven't used as much. I also enjoyed the Vostokov books, even though there is a lack of editing in them.
Edit to add: It is not as much focus on usage of the tool as it is about understanding what is going on in the dump file, you are ahead in knowledge if you can do stack traces and look at exception records.
* reading the whole source code
* looking up dependency documentation and code, search related blog posts
* getting compilation/linter warnings ands errors
* Running tests
* Running the application and validating output (eg, for a webserver, start the server, send requests, get the response)
The tooling is slowly catching up, and you can enable a bunch of this already with MCP servers, but we are nowhere near the optimum yet.
Expect significant improvements in the near future, even if the models don't get better.
It definitely does allow models to do more.
However, the high-level planning, reflection and executive function still aren't there. LLMs can nowadays navigate very complex tasks using "intuition": just ask them to do the task, give them tools, and they do a good job. But if the task is too long or requires too much information, the context length deteriorates the performance significantly, so you have to switch to a multi-step pipeline with multiple levels of execution.
This is, perhaps unexpectedly, where things start breaking down. Having the LLM write down a plan lossily compresses the "intuition", and LLMs (yes, even Gemini 2.5 Pro) cannot understand what's important to include in such a grand plan, how to predict possible externalities, etc. This is a managerial skill and seems distinct from closed-form coding, which you can always RL towards.
Errors, omissions, and assumptions baked into the plan get multiplied many times over by the subsequent steps that follow the plan. Sometimes, the plan heavily depends on the outcome of some of the execution steps ("investigate if we can..."). Allowing the "execution" LLM to go back and alter the plan results in total chaos, but following the plan rigidly leads to unexpectedly stupid issues, where the execution LLM is trying to follow flawed steps, sometimes even recognizing that they are flawed and trying to self-correct inappropriately.
In short, we're still waiting for an LLM which can keep track of high-level task context and effectively steer and schedule lower-level agents to complete a general task on a larger time horizon.
For a more intuitive example, see how current agentic browser use tools break down when they need to complete a complex, multi-step task. Or just ask Claude Code to do a feature in your existing codebase (that is not simple CRUD) the way you'd tell a junior dev.
However, this usually takes much more effort than just doing the damm thing myself.
Knows plenty of arcane commands in addition to the common ones, which is really cool & lets it do amazing things for you, the user.
To the author: most of your audience knows what MCP is, may I suggest adding a tl;dr to help people quickly understand what you've done?
How does it compare to using the Ghidra MCP server?
No comments yet
A few things that stand out:
The use of MCP to connect CDB with Copilot is genius. Too often, AI tooling is skin-deep—just a chat overlay that guesses at output. You've gone much deeper by wiring actual tool invocations to AI cognition. This feels like the future of all expert tooling.
You nailed the problem framing. It’s not about eliminating expertise—it’s about letting the expert focus on analysis instead of syntax and byte-counting. Having AI interpret crash dumps is like going from raw SQL to a BI dashboard—with the option to drop down if needed.
Releasing it open-source is a huge move. You just laid the groundwork for a whole new ecosystem. I wouldn’t be surprised if this becomes a standard debug layer for large codebases, much like Sentry or Crashlytics became for telemetry.
If Microsoft is smart, they should be building this into VS proper—or at least hiring you to do it.
Curious: have you thought about extending this beyond crash dumps? I could imagine similar integrations for static analysis, exploit triage, or even live kernel debugging with conversational AI support.
Amazing work. Bookmarked, starred, and vibed.