Ask HN: Why no inference directly from flash/SSD?
1 myrmidon 2 9/8/2025, 8:15:39 AM
My understanding is that current LLMs require a lot of space for pre-computed weights (that are constant at inference-time).
Why is it currently not feasible to just keep those in flash memory (fast PCIe SSD Raid or somesuch), and only use RAM for intermediate values/results?
Even modest success on this front seems very attractive to me, because Flash storage appears much cheaper and easier to scale than GPU memory right now.
Are there any efforts in this direction? Is this a flawed approach for some reason, or am I fundamentally misunderstanding things?
A GPU setup with a terabyte of video memory costs a fortune by comparison-- there has to be some kind of reason that people are not trying really hard to make this work, no?
No comments yet