Genes Shape the Brain – New Study Uncovers Genetic Influences on Brain Structure (helmholtz-munich.de)

I wish they could have modeled memory cache movements as well, somehow. It would have made the analysis more difficult but often these large matrix algorithms live or die by how cache-friendly the memory access patterns are.

Splitting into 4x4 blocks is typically very nice, though. Maybe it doesn’t matter so much to practical runtime.

grumpymuppet · 31d ago

I wish cache analysis was exposed more broadly everywhere. Everyone I deal with seems to understand and know how to think about it, but best practices seem to be wrapped up in a lot of heuristics.

I am aware of being able use hardware counters for profiling systems, but the more fine-grained exposure of what's happening at a lower level seems to be just hidden... Or I haven't found the right reference resource yet.

zacmps · 31d ago

Do you know any resources for learning these heuristics?

ttoinou · 31d ago

Are there researchers interested in accelerating these algorithms while also keeping the maximum accuracy of the results ? This and others optimizations are trading less multiplication for more additions, but from the little I know, on floating point additions and subtractions are risky precision wise, while multiplications are harmless. Also using FMA fused multiply add operations would be beneficial.

HappyPanacea · 31d ago

Yes see this: https://arxiv.org/abs/1507.00687 . There is also this answer ( https://mathoverflow.net/a/421380 ) in MathOverflow which state that you can't do better than cubic time if you want strong form of numerical stability.

ttoinou · 31d ago

Wow thank you I do not understand everything written and didn't take time to do so (and can't even find reference 23 paper on Brent stability), but I was thinking we don't necessarily need a new algorithm, we could tweak existing ones.

For example swapping columns and rows before applying a product, so that accuracy is best kept (then reswapping back at the end of the operation). It seems to me the Strassen-like algorithm choose arbitrarily some elements of the matrices over others to make their temporary sum-products variables, so we could exploit this asymmetry (not sure if it's the correct word here) to choose which elements are to be chosen

HappyPanacea · 30d ago

This sounds like interesting topic to investigate which I don't remember seeing but I don't think this will gain much. I think asymmetry is completely fine in this context.

almostgotcaught · 31d ago

> while also keeping the maximum accuracy of the results

All of these papers/algos are for the ML hype-train. ML algos are approximate anyway so no one cares about absolute accuracy, only the precision of the overall pipeline (class labels shouldn't change, at least not too much). Consider that very many papers/techniques quantize down to 8 or even 4 bits (yes sometimes even during training) for the purposes of perf.

This whole research area should just be renamed to something like approximate computing so that people don't confuse it (and the goals) with classical numerical analysis.

miki123211 · 31d ago

In nn training, you often get better perf at lower cost by increasing the number of parameters than by increasing parameter accuracy. Not sure if we actually know why, but that is usually the case.

E.G. a 7B model at FP32 will perform worse than a 14B model at BF16, all else being equal.

FP32 -> BF16 is really good, because modern GPUs are actually far faster at BF16 multiplications than at multiplications in FP32, not to mention the decreased memory and memory throughput requirements. 4-bit quants are much more of a mixed bag. You get the memory savings, which often means a difference between a cheap consumer GPU and a much more expensive datacenter GPU, or a single GPU versus a node with multiple ones and all the interconnect that entails. Your raw speed won't improve as dramatically, as you still need to convert from BF16 and back to do the actual computation. BF16 would probably make you memory-bound anyway, and 4-bit effectively gives you more throughput, so you get some savings, but the difference won't be as dramatic.

saagarjha · 31d ago

Surely nobody is tiling 4x4 blocks for ML training

almostgotcaught · 31d ago

Why surely? Eg on GPU style chips today you'd tile to warpsize (or half or something like that) to target warp cooperative primitives (so called tensor cores). On AMD and NV that's like 16x16 or 32x32 depending on the data type. That's not that far from 4x4 and it's not like all chips in the world have 32-lane warps. Anyway if a trick is good enough and people start trying to shoehorn it into everything (not saying this one is) then the next gen (or the one after that or after) will just have units to support the trick (it's an expensive way to gain ground on mlperf rankings).

saagarjha · 31d ago

Blackwell is doing matrix multiplies across thread blocks these days, and doesn’t understand anything smaller than 64x16x16. I assume large matrix multiples are using the bigger sizes like 128x256x16 or whatever.

6gvONxR4sf7o · 31d ago

In inference that's often the case, but numerical stability in training is often a massive lift for new algorithms.

ttoinou · 31d ago

I get your point, neural networks could be some kind of black boxes and how the sequence of operations evolves might not be so dependent on the accuracy of each operations -- at least, it would be a good intuition to believe "accuracy at each step matters" but it's not rigorously proven ? We could empirically check this by training a neural net and adding some random noise at each step

constantcrying · 31d ago

The question of stability of the algorithm is interesting, the paper does not seem to discuss it though. You are correct that we should expect the algorithms to deliver different results.

>This and others optimizations are trading less multiplication for more additions, but from the little I know, on floating point additions and subtractions are risky precision wise, while multiplications are harmless.

Certainly not true. In general different orders of magnitude are a problem. With multiplication and division these are more serve problems, as they lead to greater changes in magnitude, although even that is a generalization.

wizzwizz4 · 31d ago

Different orders of magnitude aren't a problem when multiplying floats, because the order-of-magnitude portion gets translated into addition of exponents.

They're a problem with fixed point: might you have been thinking of that?

Dylan16807 · 31d ago

We're not looking at instructions in a vacuum. No matter what you have both multiplications and additions happening.

So they're making a point that if you apply more multiplications before the inevitable addition, you are likely increasing the danger levels of that addition.

wizzwizz4 · 31d ago

Oh, if one of the multiplications yields a positive number, and another yields a negative number of similar magnitude? That makes sense; I forgot linear spaces were over fields.

lanstin · 31d ago

Additions are ok, it's subtraction when they might be close in value that is deadly.

convolvatron · 31d ago

additions arent strictly ok. if the two addends arent close together in magnitude, then one of them is going to lose some bits

lanstin · 31d ago

The result won't be far off from the correct answer, however.

amelius · 30d ago

Consider (A+B)-A, where A>>B.

lanstin · 29d ago

That has a substraction where the 2 values are nearly equaly, which is the dangerous op as I said.

BryanLegend · 31d ago

So if an AI company spends $5B on a cluster, is this optimization worth $250m?

disofu234 · 31d ago

Don't think so. This is an optimization on multiplying a matrix times its own transpose, not for generic matrix multiplication.

qoez · 31d ago

Definitely didn't cost that much to solve this particluar problem (that cost is amortized over the lifetime of the clusters existence). Compared to the CO2 to feed academics eating red meat, this is like wind power compared to coal it's way more environmentally friendly.

toolslive · 31d ago

as a general tip: X^-1 doesn't mean you have to inverse the matrix. It's often a notational shorthand for "you can solve the system". Same remark for X^t. It doesn't mean you have to build a new matrix. It just means you have to use the one you have in a different way. I've have seen this being butchered by scientists and then they complain their performance sucks.

jerf · 31d ago

Even computer programmers can get it wrong, and even reify the wrongness into interview questions. The correct answer to the common question of "how do you reverse an array" is that you don't reverse it. You read it backwards. Details vary from language to language; languages with iterators can find this particularly easy. But this is often not the "correct" answer according to the interviewer.

In the end, all data structures are functions. The academic applications of that may make your eyes glaze over, but the practical application is that if you want a reversed array/tree/etc., you just need the reading/iteration "function" for the array to return the data as if it was reversed. A matrix can be "transposed" just by reading its transpose. (Caching or other consequences may result in you choosing to manifest it in memory anyhow, but if it was stored cleverly in the first place, or you're only going to read it once anyhow such that using it once is as expensive as reading it once to reconstruct it anyhow, then there's no reason to.) An array can be randomly permuted simply by wrapping the read function with a permutation on the index, it need not be manifested in memory. etc.

kazinator · 31d ago

The "invert binary tree" interview question can be solved by swapping the accessors of the node type:

  tmpfn = tree.left
  tree.left = tree.right
  tree.right = tmpfn

Now when tree.left(node) is applied, it returns the right child, and vice versa. All traversals use the accessors, QED.

Sesse__ · 30d ago

> The correct answer to the common question of "how do you reverse an array" is that you don't reverse it. You read it backwards.

This only works if you can easily find the end during iteration. Which you cannot in e.g. the case where you reverse a part of a larger array and want to keep the result contiguous.

jerf · 30d ago

If you don't know what the end of your array is, you've got bigger problems.

I also don't understand why people seem to think that all you can do is maybe write one line of code or something. You can write as much code as it takes. Presumably, you know, not thousands and thousand just to read the array backwards or in some funky order, but you're not limited to calling the standard-library-provided "reverse" function and giving up if it doesn't exist.

These protests that "I can't do that because I'm just so helpless" make no sense to me. Read the array in whatever order you want, whenever you want. It's not that hard. It's just about the easiest thing there is. (Too much vibe coding?)

Sesse__ · 30d ago

> If you don't know what the end of your array is, you've got bigger problems.

That's called “a pointer”. Not all ways of representing arrays and spans come with explicit lengths.

> These protests that "I can't do that because I'm just so helpless" make no sense to me. Read the array in whatever order you want, whenever you want. It's not that hard. It's just about the easiest thing there is. (Too much vibe coding?)

I don't know why you think I am too helpless to write it; I've never said such a thing. I have never used AI to write code in my life, and I'm perfectly able to write code to both reverse an array and read it backwards (in multiple languages, thank you). But sometimes, you _actually want to reverse it_ because it makes other parts of the code more convenient and/or faster.

saagarjha · 31d ago

As I mentioned the last time you brought this up, one does not always have the luxury of “reading in reverse order” or “reading the transpose”. If your library/OS/hardware don’t give you an option or only work effectively in one representation then you have to actually do the work.

jerf · 30d ago

I defy you to name an environment in which it is impossible to read an array from backwards to forwards.

That's an array, as in, things already in memory. Not a stream. An array. A collection of fixed-size-in-RAM elements, contiguously in order.

That is not "an environment which doesn't provide a one-liner to do it". C, for instance, may not provide a backwards iterator, but then, it doesn't exactly provide a forwards iterator either. It is trivial to iterate backwards on an array in C. To the extent that it's a pain to provide a function that does it either way, that's C's problem, not the fact it can't be done. Everything's a pain in C.

Cache coherency is not an issue here either because I have specified it as being read once and already called out that if you're going to read it multiple times it may be worth it. Reading it once to do whatever it is you are doing is faster than reading it once, writing it in a new order, then reading it in that new order, at least on anything remotely resembling real, existing hardware.

saagarjha · 29d ago

Uh, no, you're missing the point. It's always possible for you to reverse the array and read it backwards. After all, that's how you are capable of reversing it! The problem is that the thing that you are handing it to might not be capable of that. You can't pass an array to write(2) and tell it to actually print it in reverse order. You can't mmap a page where first byte of data is actually at the high address. Your DMA engine won't flip your buffer for you. If you want to use these APIs, you have to reverse the array before you hand it to them. There's just no way around it.

Also, I do want to note that just because your language is higher level and takes iterators or sequences or whatever abstraction you have on top of "a bunch of elements" doesn't actually mean that reversing an array is never worth it. It is often the case that your API/compiler/language will support passing in an arbitrary "view", but if you actually try to do this you will find that it can only fully reason about the "forward, contiguous" buffer case, which it will optimize well. If you pass it anything funny it will fall back to the element-by-element code which can be an order of magnitude slower, or more (usually this means no vectorization, for example). In this case it's often better to pull out an optimized reverse and then pass a normal array to the API. Even though it would read the data twice this can still be a lot faster.

cmovq · 31d ago

I see this a lot in graphics, where people will for example say to create a view matrix you first create your camera transform and then invert it. When the better solution is to just directly construct the inverted matrix.

There is almost never a good reason to invert a full 4x4/3x3 transform, yet I see it all the time.

Scene_Cast2 · 31d ago

I can't name any applications off the top of my head, other than iterative matrix multiplication for approximate eigenvector finding in square matrixes. But I don't know what's actually used for finding eigenvectors (or other decompositions for that matter).

vqv · 31d ago

It’s pretty fundamental in a variety of multivariate statistical methods. If the rows of X are multi variate observations, then XX’ is the Gram matrix (of dot products). This can be used in clustering and regression. If the columns of X are (centered) multivariate observations, then XX’ is a scalar multiple of the sample covariance matrix. This is used in PCA.

But in large scale applications you may not want to store XX’ but instead are interested in computing products of the form XX’ v on the fly.

adgjlsfhk1 · 31d ago

generally good implementations use QR instead

TimorousBestie · 31d ago

It’s a fairly common operation. The sample covariance of a vector-valued random variable is XX^t/N.

constantcrying · 31d ago

I think one particularly interesting result is that superior algorithms for numerical linear algebra exist at all and can be found by artificial intelligence.

XX^T is the matrix of all piecewise dot products of vectors in X and as others have pointed out there are legitimate applications. Another one being e.g. transforming an undetermined linear system into a least squares problem.

FabHK · 31d ago

The solution to the least squares problem Ax ≈ b is given by x = (A'A)^-1 A'b, but modern solvers never form the matrix product A'A explicitly. Even for the SVD it isn't formed.

jerf · 31d ago

Google put something out like this recently too. The original is https://deepmind.google/discover/blog/alphaevolve-a-gemini-p..., but it's sort of buried in there. Here's the YouTube video that introduced it to me, forwarded to the spot where the host talks about the improved 4x4 matrix multiplication algorithm: https://www.youtube.com/watch?v=sGCmu7YKgPA&t=396s It's only a slight tweak, but it's a slight tweak to something pretty highly studied and still impressive.

Bootvis · 31d ago

I expect this algorithm or similar to work for X^tX as well. Fun fact, that operation is common enough that a trading firm was named after it: XTX Markets.

bee_rider · 31d ago

Could be useful when doing an SVD as well (although, really, you don’t want X’X but X’Xv for some vector…).

blobbers · 31d ago

Covariance.

jethkl · 30d ago

how helpful was ai for this? The paper is light on details, but it says the agent was used to generate a kind of seed set (rank-1 bilinear products) that were then fed into the subsequent steps. Evidently this idea succeeded. Curious if anyone here has insight into if this is a common technique, how this agent's output would compare to random or a simple heuristic that attempts the same. Also interested to see how the training objective gets defined since the final task is a couple of steps downstream from what the agent generates.

kazinator · 31d ago

Under multiplication, rows go into columns and all that. So if we transpose things, rows are being dot-producted with (what were originally) rows:

  [ a b ] [ a c ] = [ (a b) . (a b)  (a b) . (c d) ] = [   |(a b)]^2     (a b) . (c d) ]
  [ c d ] [ b d ]   [ (c d) . (a b)  (c d) . (c d) ]   [ (c d) . (a b)      |(c d)|^2  ]

Down the diagonal we have squared norms of (a b) and (c d) which is interesting and could somehow be used.

We also have repeated values between the upper and lower triangle because (a b) . (c d) is the same as (c d) . (a b): the dot product commutes.

Whereas just squaring the matrix:

  [ a b ] [ a b ] = [ (a b) . (a c)  (a b) . (b d) ]
  [ c d ] [ c d ]   [ (c d) . (a c)  (c d) . (b d) ]

all four dot products are unique.

So right of the bat, we can find interesting things about X X^t without going deep, and an obvious shortcut.

selimthegrim · 31d ago

I guess the news is the “for small sizes” part

meisel · 31d ago

Is this like the Karatsuba algorithm, where it's theoretically faster but not actually faster when run on real hardware?

Btw, it's worth noting that if you know that the result will be symmetric (such as is the case for X * X^T), you can make things faster. For example in cuBLAS, cublas*syrk (the variant optimized for when the result is symmetric) IME isn't faster than gemm, so what you can do instead is just do smaller multiplications that fill in one of the two triangles piece by piece, and then copy that triangle to the other one.

pbsd · 31d ago

Karatsuba is definitely faster than schoolbook multiplication at practical sizes. You presumably mean Strassen.

thrance · 31d ago

They tested it against BLAS primitives and found theirs to be faster, in hardware.

bee_rider · 31d ago

The mention 5% improvements and small matrices in the abstract, so my gut says (I haven’t read the actual paper yet) that is probably is a practical-type algorithm.

qoez · 31d ago

Since this is 2x2 there's simd instructions that can do this (or with two simd dot products) both on CPU and inside each GPU core. So with current hardware you won't beat writing this out manually.

odo1242 · 31d ago

It’s faster for matrices of approximately at least ~256x256, though it depends on hardware

optimalsolver · 31d ago

>where it's theoretically faster but not actually faster when run on real hardware

Aka a galactic algorithm:

https://en.wikipedia.org/wiki/Galactic_algorithm

thrance · 31d ago

I feel like a galactic algorithm is slow even theoretically, unless you work with out-of-this-world data.

Whereas many algorithms are theoretically fine for real data but don't make good use of cache, or instruction sets...

End-to-End Encryption: Architecturally Necessary (labs.ripe.net)

AI Is Changing Work–and Indie Devs Feel It First (plumshell.com)

Gemini Flash 2.5, Imagen 4 and Veo 2 Chaining for Multi-modal Characters (musings-mr.net)

Years of inactivity in "Pay or OK" cases: noyb sues German DPAs (noyb.eu)

Air India: Is There a Safest Seat in a Plane Crash? (nytimes.com)

Nodepass: Secure, efficient TCP/UDP tunneling solution (github.com)

East German Uprising of 1953 (en.wikipedia.org)

Show HN: I created a Chrome Extension that improves X/Twitter (getreplyguy.com)

Show HN: Zentra – The AI Travel Planner (zentra.hridya.tech)

Making Playpen Sans (type-together.com)

Show HN: A boilerplate for Kotlin Multiplatform to launch Android and iOS apps (kmpship.app)

Pentagon pizza monitor predicted 'busy night' ahead of Israel's attack on Iran (theguardian.com)

Phkmalloc (phk.freebsd.dk)

The Fediverse: A New Era of Social Media (youtube.com)

The 'Superlative' Injunction: India's Pirate Site Blockades Go Next Level (torrentfreak.com)

Show HN: Taiyaki – AI-generated 3D jewelry from photos and sketches (taiyaki.ai)

Show HN: AI Text to Music Generator (text2music.net)

John Carmack at Upper Bound 2025 [video] (youtube.com)

MiniWork – An Intelligent and Efficient AI Tool Platform

Along the Green Line (theguardian.com)

MiniMax-M1, first open-weight, large-scale hybrid-attention reasoning model (github.com)

BMW ConnectedDrive lets me control my returned rental car (Sixt)

Is AI code sabotaging its own ROI case? (leaddev.com)

Do humans pose a threat to artificial intelligence?

Save Our Wallets: Protect Your Right to Transact Freely (saveourwallets.org)

Slint 1.12 Released with WGPU Support, iOS Port, and Figma Variables Integration (slint.dev)

Making Complex Technical Concepts Click (leadinginproduct.com)

The Battle for Starbucks' Coffee-Making Crown (wsj.com)

Show HN: Jan-nano, 4B agentic model that outperforms DeepSeek-v3-671B using MCP (twitter.com)

VibeTunnel: Turn Any Browser into Your Mac's Terminal (steipete.me)

Sound Static Data Race Verification for C: Is the Race Lost? (pldi25.sigplan.org)

People wrecking the climate 140 years ago – we just lacked the tech to spot it (nature.com)

A different approach at liquid glass on the web (specy.app)

Defense Department signs OpenAI for $200M 'frontier AI' pilot project (theregister.com)

Labubu Doll AI Generator (monet.vision)

I Tried Pre-Ordering the Trump Phone. The Page Failed (404media.co)

Timmy Mallet completes cycle around Ireland (rte.ie)

A planetarium show discovered a spiral at the edge of our solar system (fastcompany.com)

Missing the Road for the Trees Safety Anomalies in Tesla's FSD System (drive.google.com)

Tankers Front Eagle and Adalynn Collide East of the Strait of Hormuz [video] (youtube.com)

Michael I. Jordan – scaling excellence in A.I., statistics and beyond [video] (youtube.com)

Genes Shape the Brain – New Study Uncovers Genetic Influences on Brain Structure (helmholtz-munich.de)

Fun with Telnet (brandonrozek.com)

Working Memory, It's More Complicated Than We Thought (nyu.edu)

Poll: Do "tech" companies design, build and distribute products

Polyhedra Viewer (polyhedra.tessera.li)

Startup seeks Trump AI emergency for California tech city (thenerdreich.com)

A Most Important Artifact (2015) (cen.acs.org)

Understanding Assembly Indices (molecular-assembly.com)

Why Kubernetes Throttled My Idle Pods (mattthorne.github.io)

X X^t can be faster

Comments (60)