Matrix-vector multiplication implemented in off-the-shelf DRAM for Low-Bit LLMs

98 cpldcpu 18 5/4/2025, 11:35:30 PM arxiv.org ↗

Comments (18)

cpldcpu · 2h ago
Some more background information:

One of the original proposals for in-DRAM compute: https://users.ece.cmu.edu/~omutlu/pub/in-DRAM-bulk-AND-OR-ie...

First demonstration with off-the-shelf parts: https://parallel.princeton.edu/papers/micro19-gao.pdf

DRAM Bender, the tool they are using to implement this: https://github.com/CMU-SAFARI/DRAM-Bender

Memory-Centric Computing: Recent Advances in Processing-in-DRAMhttps://arxiv.org/abs/2412.19275

userbinator · 2h ago
Did anyone else notice the absolutely insane author lists of references 1 and 3?

I was expecting to find this 2016 article in there: https://news.ycombinator.com/item?id=12469270

This 2019 one does show up: https://news.ycombinator.com/item?id=22712811

Of course, this "out of spec" behaviour of DRAM, more specifically the ability to do copying, is also implicated in this infamous bug: https://news.ycombinator.com/item?id=5314959

It seems more than one person independently observed such a thing, and thought "this might be a useful behaviour".

walterbell · 2h ago
> By intentionally issuing DRAM commands that violate manufacturer-specified timing parameters.. [gaining] massive parallelism up to 65,536 bitwise operations in parallel.

Take that, binary blobs for DRAM training!

morphle · 1h ago
A bit unscientific that they don't cite the original Intelligent RAM (IRAM) sources from 1997:

https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=iram...

cpldcpu · 1h ago
I also strongly suspect that there are earlier sources.

However, IRAM looks like compute near memory where they will add an ALU to the memory chip. compute in memory is about using the memory array itself.

To be fair, CIM looked much less appealing before the advent of deep-learning with crazy vector lengths. So people rather tried to build something that allows more fine grained control of the operations.

morphle · 1h ago
>I also strongly suspect that there are earlier sources.

You are right, I remember 1972-ish papers where they did compute in memory. I just couldn't locate links to these papers in a few minutes.

robwwilliams · 4h ago
This is just mind-bendingly weird and wonderfully creative. It can pay to work in the weeds! Bravo.
userbinator · 1h ago
This behaviour has been around since the earliest DRAMs with multiplexed row/column addresses. The Mostek MK4096 of 1973 could probably do this. Only took about half a century for someone to figure it out.
Bolwin · 5h ago
They're doing matrix operations in the Dram itself? That sounds insane and also fascinating
nkurz · 5h ago
Yup, and incredibly they are able to do this on standard RAM by "intentionally violating the timing parameters":

Processing-Using-DRAM (PUD) leverages the inherent analog operational characteristics of DRAM to enable highly parallel bit-serial computations directly within memory arrays. Prior research has demonstrated that commercial off-the-shelf DRAM can achieve PUD functionality without hardware modifications by intentionally violating the timing parameters.

These studies have established two fundamental PUD operations: RowCopy and majority-of-X (MAJX) (Fig. 1). The RowCopy operation facilitates data movement between different rows within a subarray by issuing a PRE command followed immediately by an ACT command before bitline precharging completes, enabling data transfer through the bitlines. This operation affects all cells along a row simultaneously, making it approximately 100 times faster than processor-mediated data movement. The MAJX operation performs a majority vote among X cells sharing the same bitline that are activated simultaneously, implemented in commercial DRAM by issuing ACT, PRE, and ACT commands in rapid succession without delays. This allows concurrent activation of 2∼32 rows. MAJX enables bit-serial computations that leverage the parallelism of subarrays with 65,536 columns, serving as the fundamental computational unit for PUD.

nayuki · 4h ago
This kind of low-level protocol manipulation of DRAM has some similarities to rowhammer attacks.
elcritch · 3h ago
I hope Micron or another commercial player builds a product on this!
tamlin · 56m ago
Samsung and SK-Hynix have had specs and papers for a few years already for HBM and GDDR. e.g.

* https://www.servethehome.com/sk-hynix-ai-memory-at-hot-chips... * https://www.servethehome.com/samsung-processing-in-memory-te...

Not sure anyone has started using it in production.

summarity · 5h ago
Getting LLM inference running on any thing is going to be the next “it runs Doom”
willvarfar · 2h ago
Can we expect to see matrix multiplication and perhaps other ops move from classic CPUs out into the DRAM, perhaps with deliberate hardware support?

And does such a processing shift give advantage to Samsung etc? Where does this leave NVIDIA etc?

imtringued · 58m ago
Your questions are kind of amusing since Apple will use LPDDR6-PIM on the next generation of iPhones.

https://www.patentlyapple.com/2024/12/apple-plans-to-transit...

xiphias2 · 1h ago
This woule be a cool way to make a cheap inferencing device for the largest LLMs
swimwiththebeat · 2h ago
So is this a new technique of doing computations within existing DRAM to overcome the memory wall issue of modern computing?