Show HN: I Am 15 and Built a Dual Backend MLP from Scratch Using CUDA C++

1 muchlakshay 2 7/23/2025, 6:59:29 AM github.com ↗
hii everyone! I'm a 15-year-old and I just completed a dual backend MLP from scratch that supports both CPU and GPU (CUDA) training.

for the CPU backend, I used only Eigen for linear algebra, nothing else.

for the GPU backend, I implemented my own custom matrix library in CUDA C++. The CUDA kernels aren’t optimized with shared memory, tiling, or fused ops (so there’s some kernel launch overhead), but I chose clarity, modularity, and reusability over a few milliseconds of speedup.

that said, I've taken care to ensure coalesced memory access, and it gives pretty solid performance, around 0.4 ms per epoch on MNIST (batch size = 1000) using an RTX 3060.

This project is a big step up from my previous one. It's cleaner, well-documented, and more modular.

I’m fully aware of areas that can be improved, and I’ll be working on them in future projects. My long-term goal is to get into Harvard or MIT, and this is part of that journey.

would love to hear your thoughts, suggestions, or feedback

ive attached the link to my GitHub Repo

Comments (2)

onelli · 6h ago
Love seeing young devs shipping real projects! Out of curiosity, have you tried benchmarking your MLP on any real-world data sets, or was this mainly about learning CUDA/C++? (And what’s the biggest gotcha you ran into?)
muchlakshay · 5h ago
thanks!!!! appreciate that a lot. i’ve mainly tested it on MNIST for now, the CUDA backend trains one epoch in ~0.4ms (batch size 1000, RTX 3060, as i mentioned in the post). It was primarily a deep dive into CUDA/C++, manual memory management, and building a dual backend architecture with a custom matrix lib (GPU-side completely from scratch). this was actually my 4th serious attempt at building a GPU-based MLP from scratch. I failed multiple times, sometimes due to a single line of code. in earlier attempts, i had this optimization idea: store both the weights and their transposes in GPU memory, so i wouldn’t have to compute the transpose each epoch. Seemed clever, until training started failing badly. Turned out I was only updating the original weights matrix after backprop, but the transposed one was still holding stale values from earlier. this broke training completely, and I spent weeks trying to debug it, couldn’t figure it out until this current version.

honestly, the biggest gotchas were-

-memory coherence issues like above (esp. when trying to cache 'smartly')

-launching kernels in the right order while keeping data in sync

-maintainingg modularity without sacrificing too much performance

i avoided fused kernels/shared memory in this version to keep things clean and reusable, but now that the core works, I plan to start optimizing that layer too.