X X^t can be faster

112 robinhouston 33 5/16/2025, 3:45:30 PM arxiv.org ↗

Comments (33)

TimorousBestie · 2h ago
I wish they could have modeled memory cache movements as well, somehow. It would have made the analysis more difficult but often these large matrix algorithms live or die by how cache-friendly the memory access patterns are.

Splitting into 4x4 blocks is typically very nice, though. Maybe it doesn’t matter so much to practical runtime.

ttoinou · 2h ago
Are there researchers interested in accelerating these algorithms while also keeping the maximum accuracy of the results ? This and others optimizations are trading less multiplication for more additions, but from the little I know, on floating point additions and subtractions are risky precision wise, while multiplications are harmless. Also using FMA fused multiply add operations would be beneficial.
constantcrying · 2h ago
The question of stability of the algorithm is interesting, the paper does not seem to discuss it though. You are correct that we should expect the algorithms to deliver different results.

>This and others optimizations are trading less multiplication for more additions, but from the little I know, on floating point additions and subtractions are risky precision wise, while multiplications are harmless.

Certainly not true. In general different orders of magnitude are a problem. With multiplication and division these are more serve problems, as they lead to greater changes in magnitude, although even that is a generalization.

wizzwizz4 · 18m ago
Different orders of magnitude aren't a problem when multiplying floats, because the order-of-magnitude portion gets translated into addition of exponents.

They're a problem with fixed point: might you have been thinking of that?

Dylan16807 · 13m ago
We're not looking at instructions in a vacuum. No matter what you have both multiplications and additions happening.

So they're making a point that if you apply more multiplications before the inevitable addition, you are likely increasing the danger levels of that addition.

wizzwizz4 · 1m ago
Oh, if one of the multiplications yields a positive number, and another yields a negative number? That makes sense; I forgot linear spaces were over fields.
almostgotcaught · 31m ago
> while also keeping the maximum accuracy of the results

All of these papers/algos are for the ML hype-train. ML algos are approximate anyway so no one cares about absolute accuracy, only the precision of the overall pipeline (class labels shouldn't change, at least not too much). Consider that very many papers/techniques quantize down to 8 or even 4 bits (yes sometimes even during training) for the purposes of perf.

This whole research area should just be renamed to something like approximate computing so that people don't confuse it (and the goals) with classical numerical analysis.

saagarjha · 13m ago
Surely nobody is tiling 4x4 blocks for ML training
Scene_Cast2 · 3h ago
I can't name any applications off the top of my head, other than iterative matrix multiplication for approximate eigenvector finding in square matrixes. But I don't know what's actually used for finding eigenvectors (or other decompositions for that matter).
vqv · 3h ago
It’s pretty fundamental in a variety of multivariate statistical methods. If the rows of X are multi variate observations, then XX’ is the Gram matrix (of dot products). This can be used in clustering and regression. If the columns of X are (centered) multivariate observations, then XX’ is a scalar multiple of the sample covariance matrix. This is used in PCA.

But in large scale applications you may not want to store XX’ but instead are interested in computing products of the form XX’ v on the fly.

Bootvis · 8m ago
I expect this algorithm or similar to work for X^tX as well. Fun fact, that operation is common enough that a trading firm was named after it: XTX Trading.
TimorousBestie · 3h ago
It’s a fairly common operation. The sample covariance of a vector-valued random variable is XX^t/N.
bee_rider · 2h ago
Could be useful when doing an SVD as well (although, really, you don’t want X’X but X’Xv for some vector…).
blobbers · 59m ago
Covariance.
constantcrying · 2h ago
I think one particularly interesting result is that superior algorithms for numerical linear algebra exist at all and can be found by artificial intelligence.

XX^T is the matrix of all piecewise dot products of vectors in X and as others have pointed out there are legitimate applications. Another one being e.g. transforming an undetermined linear system into a least squares problem.

FabHK · 2h ago
The solution to the least squares problem Ax ≈ b is given by x = (A'A)^-1 A'b, but modern solvers never form the matrix product A'A explicitly. Even for the SVD it isn't formed.
BryanLegend · 3h ago
So if an AI company spends $5B on a cluster, is this optimization worth $250m?
disofu234 · 3h ago
Don't think so. This is an optimization on multiplying a matrix times its own transpose, not for generic matrix multiplication.
qoez · 2h ago
Definitely didn't cost that much to solve this particluar problem (that cost is amortized over the lifetime of the clusters existence). Compared to the CO2 to feed academics eating red meat, this is like wind power compared to coal it's way more environmentally friendly.
toolslive · 1h ago
as a general tip: X^-1 doesn't mean you have to inverse the matrix. It's often a notational shorthand for "you can solve the system". Same remark for X^t. It doesn't mean you have to build a new matrix. It just means you have to use the one you have in a different way. I've have seen this being butchered by scientists and then they complain their performance sucks.
selimthegrim · 1h ago
I guess the news is the “for small sizes” part
meisel · 2h ago
Is this like the Karatsuba algorithm, where it's theoretically faster but not actually faster when run on real hardware?

Btw, it's worth noting that if you know that the result will be symmetric (such as is the case for X * X^T), you can make things faster. For example in cuBLAS, cublas*syrk (the variant optimized for when the result is symmetric) IME isn't faster than gemm, so what you can do instead is just do smaller multiplications that fill in one of the two triangles piece by piece, and then copy that triangle to the other one.

pbsd · 1h ago
Karatsuba is definitely faster than schoolbook multiplication at practical sizes. You presumably mean Strassen.
drlobster · 2h ago
If only they had written a paper you should scan through and see the speed improvements and the hardware used.

Oh well never mind

meisel · 2h ago
I figured it might, but I think that this is a top of mind question for people and would be nice to make clear in the comments of the post too. So often there’s some theoretical improvement on multiplication that isn’t actually practical. Regardless, they don’t seem to have posted results for CUDA, which is arguably more important than CPU multiplication which is what they tried
tsurba · 2h ago
Probably why they tried to address it in the abstract already
tptacek · 2h ago
Let's have more comments from domain experts comparing algorithms in the abstract to cuBLAS, and less like these. Thanks!

If they're wrong to speculate, well, there's a whole paper you can just go skim to find the bit that rebuts them.

bee_rider · 2h ago
The mention 5% improvements and small matrices in the abstract, so my gut says (I haven’t read the actual paper yet) that is probably is a practical-type algorithm.
qoez · 2h ago
Since this is 2x2 there's simd instructions that can do this (or with two simd dot products) both on CPU and inside each GPU core. So with current hardware you won't beat writing this out manually.
odo1242 · 1h ago
It’s faster for matrices of approximately at least ~256x256, though it depends on hardware
thrance · 2h ago
They tested it against BLAS primitives and found theirs to be faster, in hardware.
optimalsolver · 2h ago
>where it's theoretically faster but not actually faster when run on real hardware

Aka a galactic algorithm:

https://en.wikipedia.org/wiki/Galactic_algorithm

thrance · 2h ago
I feel like a galactic algorithm is slow even theoretically, unless you work with out-of-this-world data.

Whereas many algorithms are theoretically fine for real data but don't make good use of cache, or instruction sets...