Ask HN: Why no inference directly from flash/SSD?

1 myrmidon 2 9/8/2025, 8:15:39 AM
My understanding is that current LLMs require a lot of space for pre-computed weights (that are constant at inference-time).

Why is it currently not feasible to just keep those in flash memory (fast PCIe SSD Raid or somesuch), and only use RAM for intermediate values/results?

Even modest success on this front seems very attractive to me, because Flash storage appears much cheaper and easier to scale than GPU memory right now.

Are there any efforts in this direction? Is this a flawed approach for some reason, or am I fundamentally misunderstanding things?

Comments (2)

sunscream89 · 16h ago
> A typical DRAM has a transfer rate of approximately 2-20GB/s, whereas typical SSDs have a transfer rate of 50MB-200MB/s. So it's one to two orders of magnitude slower.
myrmidon · 11h ago
I don't think the bandwidth gap is that big-- single WD SN8100 drives (before any potential gain from RAID) already have sequential read speed of >10GB/s for under $200 and 1TB of storage.

A GPU setup with a terabyte of video memory costs a fortune by comparison-- there has to be some kind of reason that people are not trying really hard to make this work, no?

No comments yet