Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory, High-Throughput AI

27 transpute 3 6/23/2025, 1:43:11 AM arxiv.org ↗

Comments (3)

WithinReason · 48m ago
Isn't this a software problem being solved in hardware? Ideally you would try to avoid going to memory in the first place by fusing the operations, which should be much faster than speeding up memory ops. E.g. you should never do an explicit im2col before a convolution, it should be fused. However it's hard to argue with a 0.019 mm2 area increase.
mikewarot · 31m ago
The only memory involved should be at the input and output of a pipeline stage that does an entire layer of an LLM. I'm of the opinion that we'll end up with effectively massive FPGAs with some stages of pipelining that have NO memory access internally, so that you get one token per clock cycle.

100 million tokens per second is currently worth about $130,000,000/day. (Or so ChatGPT 4.1 told me a few days ago)

I'd like to drop that by a factor of at least 1000:1

KnuthIsGod · 4h ago
Cutting edge and innovative AI hardware research from China.

Looks like Amerikan sanctions are driving a new wave of innovation in China.

" This work addresses that gap by introducing the Ten- sor Manipulation Unit (TMU): a reconfigurable, near-memory hardware block designed to execute data-movement-intensive (DMI) operators efficiently. TMU manipulates long datastreams in a memory-to-memory fashion using a RISC-inspired execution model and a unified addressing abstraction, enabling broad support for both coarse- and fine-grained tensor transformations.

The proposed architecture integrates TMU alongside a TPU within a high-throughput AI SoC, leveraging double buffering and output forwarding to improve pipeline utilization. Fab- ricated in SMIC 40 nm technology, the TMU occupies only 0.019 mm2 while supporting over 10 representative TM operators. Benchmarking shows that TMU alone achieves up to 1413.43× and 8.54× operator-level latency reduction over ARM A72 and NVIDIA Jetson TX2, respectively.

When integrated with the in- house TPU, the complete system achieves a 34.6% reduction in end-to-end inference latency, demonstrating the effectiveness and scalability of reconfigurable tensor manipulation in modern AI SoCs."