The Illusion of Thinking: Understanding the Limitations of Reasoning LLMs [pdf] (ml-site.cdn-apple.com)

I'm not an expert in this space, but is this meaningful? I'd assume that it's more common to fuse together transposition with an operation that precedes or follows it (e.g. matmul), which should be far more efficient than materializing the entire transposition in memory if it's just an intermediate value.

musebox35 · 3h ago

Matrix transpose is a canonical example of a memory bound operation and often used to showcase optimization in a particular programming language or library. See for example the cutlass matrix transpose tutorial from Jay Shah of flash attention 3 paper: https://research.colfax-intl.com/tutorial-matrix-transpose-i...

londons_explore · 10h ago

Why do we ever need to transpose a matrix?

Isn't it better to simply combine the transposition with whatever next operation one wishes to do with the matrix?

hogepodge · 9h ago

You're right that a good graph compiler will do this for you. There still may be times, like if you're interfacing with another library, where you'll need to switch a matrix between row major or column major layouts.

throwawayabcdef · 9h ago

The next operation might need the data in column major order to read it fast. So you might have to transpose first. And these maybe be concurrent stages of a processing pipeline.

viraptor · 8h ago

Now I'm curious, how many times do you have to fully read the matrix in GPU for the total impact of reading columns to be higher than one-off actual transpose and then sequential row reads? I know it depends on lots of things, I'm after a rough estimate.

fulafel · 3h ago

This could make Mojo look even better as it would ld be more compute heavy and the last step thread reduction would be less relevant.

melodyogonna · 2h ago

I wonder if there is a reason for not using the high level abstractions provided by Modular

arjvik · 13h ago

Where's the 14%? Looks like their final kernels show a 0.14% improvement of Mojo over the equivalent CUDA kernel?

77pt77 · 13h ago

It looks because it does.

>(2771.35/2775.49 - 1) * 100 = -.14916285052369131300

Flagged.

timmyd · 12h ago

Updated the title to the original. I did base the numbers on

"This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) which is still impressive

vlan121 · 13h ago

Mojos compiler is closed source. Thats a big no-no

dgurchenkov · 10h ago

I work on Mojo. The whole compiler, runtime etc. will get open sourced, most likely within a year. It is just a matter of time and us getting all the required work done.

https://docs.modular.com/mojo/faq/#open-source

almostgotcaught · 9h ago

> runtime

Are you talking about your libc equivalent or MAX?

colesantiago · 13h ago

Does anyone use Mojo in production at all or are even hiring for Mojo?

melodyogonna · 9m ago

Modular (the company behind Mojo) uses it in production. I imagine that if they have any clients then those also use Mojo in production - albeit indirectly - since all the GPU kernels used by Modular are written in Mojo.

jsnell · 13h ago

The "Switching to Mojo gave a 14% improvement over CUDA" title is editorialized, the original is "Highly efficient matrix transpose in Mojo".

Also, the improvement is 0.14%, not 14% making the editorialized linkbait particularly egregious.

timmyd · 10h ago

[op here] To be clear: Yes, there are 3 kernels - you can see those in the linked github at the end of the article if you clicked that. These are:

transpose_naive - Basic implementation with TMA transfers

transpose_swizzle - Adds swizzling optimization for better memory access patterns

transpose_swizzle_batched - Adds thread coarsening (batch processing) on top of swizzling

Performance comparison with CUDA: The Mojo implementations achieve bandwidths of:

transpose_naive: 1056.08 GB/s (32.0025% of max)

transpose_swizzle: 1437.55 GB/s (43.5622% of max)

transpose_swizzle_batched: 2775.49 GB/s (84.1056% of max)

via the GitHub - simveit/efficient_transpose_mojo

Comparing to the CUDA implementations mentioned in the article:

Naive kernel: Mojo achieves 1056.08 GB/s vs CUDA's 875.46 GB/s

Swizzle kernel: Mojo achieves 1437.55 GB/s vs CUDA's 1251.76 GB/s

Batched swizzle kernel: Mojo achieves 2775.49 GB/s vs CUDA's 2771.35 GB/s

So there is highly efficient matrix transpose in Mojo

All three Mojo kernels outperform their CUDA counterparts, with the naive and swizzle kernels showing significant improvements (20.6% and 14.8% faster respectively), while the final optimized kernel achieves essentially identical performance (slightly better by 4.14 GB/s).

The "flag" here seemed innapropriate given that its true this implementation is indeed faster, and certainly the final iteration could be improved on further. It wasn't wrong to say 14% or even 20%.

jsnell · 9h ago

Users of the site only have one control available: the flag. There's no way to object only to the title but not to the post, and despite what you say that title hit the trifecta: not the original title, factually incorrect, and clickbait. So I'm not that surprised it got flagged (even if I did not flag it myself).

Email the mods at hn@ycombinator.com. There's a chance they'll remove the flag and re-up the post.

timmyd · 9h ago

thanks jsnell - i did they and they appreciated the comment above, and unflagged it. i appreciate it!

atomicapple · 13h ago

I think the OP based the title off of "This kernel archives 1437.55 GB/s compared to the 1251.76 GB/s we get in CUDA" (14.8%) and not the final kernels for whatever reason

No comments yet

jebarker · 13h ago

Yeah, it seems like the blog post is just meant to be an example of how to do something in Mojo and not a dunk on CUDA.

timmyd · 9h ago

FWIW I didnt take the blog as a dunk on CUDA, just as an impressive outcome from the blog writer in Mojo. It's awesome to see this on Hopper - if it makes it go faster thats awesome.

baal80spam · 13h ago

0.14% is within the limits of statistical error. So this is a nothing-"article".

jsnell · 13h ago

I don't think that's fair. The article promised a highly efficient kernel and seems to have delivered exactly that, which isn't "nothing". My beef is entirely with the submitted title.

voronar · 13h ago

Mr. Mojo Risin'

noracists · 13h ago

slop

almostgotcaught · 9h ago

As someone said below - you'd never write just a transpose kernel - it'll be fused into something else.

htrp · 13h ago

Left unsaid, the 14% improvement in performance came at the cost of increasing dev time by 35%

bravetraveler · 13h ago

Reminds me of this, lol:

> "From the moment I understood the weakness of my flesh, it disgusted me. I craved the strength and certainty of steel."

14% all the time vs 35% some of the time

edit: Closing numbers are far less impressive than those buried in the middle of the post. Confusing; bye everyone

Low-Level Optimization with Zig (alloc.dev)

The FAIR Package Manager: Decentralized WordPress infrastructure (joost.blog)

Researchers develop ‘transparent paper’ as alternative to plastics (japannews.yomiuri.co.jp)

The time bomb in the tax code that's fueling mass tech layoffs (qz.com)

Falsehoods programmers believe about aviation (flightaware.engineering)

Getting Past Procrastination (spectrum.ieee.org)

How we decreased GitLab repo backup times from 48 hours to 41 minutes (about.gitlab.com)

A year of funded FreeBSD development (daemonology.net)

Why are smokestacks so tall? (practical.engineering)

Sharing everything I could understand about gradient noise (blog.pkh.me)

Ziina (YC W21) the Series A fintech is hiring product engineers (ziina.notion.site)

Windows 10 spies on your use of System Settings (2021) (michaelhorowitz.com)

Highly efficient matrix transpose in Mojo (veitner.bearblog.dev)

The Illusion of Thinking: Understanding the Limitations of Reasoning LLMs [pdf] (ml-site.cdn-apple.com)

Medieval Africans had a unique process for purifying gold with glass (2019) (atlasobscura.com)

Sandia turns on brain-like storage-free supercomputer (blocksandfiles.com)

Smalltalk, Haskell and Lisp (storytotell.org)

A masochist's guide to web development (sebastiano.tronto.net)

I Read All of Cloudflare's Claude-Generated Commits (maxemitchell.com)

Odyc.js – A tiny JavaScript library for narrative games (odyc.dev)

Workhorse LLMs: Why Open Source Models Dominate Closed Source for Batch Tasks (sutro.sh)

NASA delays next flight of Boeing's alternative to SpaceX Dragon (theedgemalaysia.com)

Show HN: AI game animation sprite generator (godmodeai.cloud)

Wendelstein 7-X sets new fusion record (heise.de)

Too Many Open Files (mattrighetti.com)

Curate your shell history (esham.io)

Series C and scale (cursor.com)

Meta: Shut down your invasive AI Discover feed (mozillafoundation.org)

What you need to know about EMP weapons (aardvark.co.nz)

Reverse Engineering Cursor's LLM Client (tensorzero.com)

Weaponizing Dependabot: Pwn Request at its finest (boostsecurity.io)

4-7-8 Breathing (breathbelly.com)

Show HN: Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

Freight rail fueled a new luxury overnight train startup (freightwaves.com)

What “working” means in the era of AI apps (a16z.com)

SaaS is just vendor lock-in with better branding (rwsdk.com)

Swift and the Cute 2d game framework: Setting up a project with CMake (layer22.com)

Researchers find a way to make the HIV virus visible within white blood cells (theguardian.com)

United States Digital Service Origins (usdigitalserviceorigins.org)

Physicists observe a new form of magnetism (news.mit.edu)

An Interactive Guide to Rate Limiting (blog.sagyamthapa.com.np)

How to (actually) send DTMF on Android without being the default call app (edm115.dev)

A Rippling Townhouse Facade by Alex Chinneck Takes a Seat in a London Square (thisiscolossal.com)

Dreams of improving the human race are no longer science fiction (economist.com)

CRDTs #4: Convergence, Determinism, Lower Bounds and Inflation (jhellerstein.github.io)

HZ-program (Typesetting algorithm by Hermann Zapf) (en.wikipedia.org)

Test Postgres in Python Like SQLite (github.com)

Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction (zju3dv.github.io)

The Coleco Adam Computer (dfarq.homeip.net)

Supreme Court allows DOGE to access social security data (nbcnews.com)

Highly efficient matrix transpose in Mojo

Comments (30)