Using Radicle CI (radicle.xyz)

1 points by aiw1nt3rs 3m ago 0 comments

SDR42E1 modulates Vitamin D absorption and cancer pathogenesis (frontiersin.org)

1 points by bookofjoe 5m ago 0 comments

Computational Tyranny (happyfellow.bearblog.dev)

3 points by 4ad 7m ago 0 comments

Google Keeps Making Smartphones Worse (jacobin.com)

1 points by PaulHoule 9m ago 0 comments

Why are we abandoning our research on Mars? (washingtonpost.com)

1 points by voxleone 9m ago 0 comments

20 years of Linux on the Desktop (part 4) (ploum.net)

2 points by todsacerdoti 9m ago 0 comments

Show HN: Built a spaceship endless runner in 7 days using AI tools (thrustissues.netlify.app)

1 points by dhruvania 9m ago 1 comments

Show HN: SQLite AI - Open-Source Extensions to Bring AI to SQLite, Everywhere

1 points by marcobambini 10m ago 0 comments

Zeitwerk: The Autoloader That Rails Deserves, but Not the One It Needs (rubystacknews.com)

1 points by thunderbong 11m ago 0 comments

The Append-and-Review Note (karpathy.bearblog.dev)

1 points by vinhnx 11m ago 0 comments

Three high-performance RISC-V processors to watch in H2 2025 (cnx-software.com)

1 points by fork-bomber 13m ago 0 comments

Offline Firmware Patch (tiniuc.com)

1 points by tiniuclx 14m ago 0 comments

Proton's new privacy-first AI assistant encrypts all chats, keeps no logs (techcrunch.com)

1 points by gpi 18m ago 0 comments

Inverted Indexes: A Step-by-Step Implementation Guide (chashnikov.dev)

1 points by klaussilveira 19m ago 0 comments

Are Australian Financial Security Authority's digital assets secure today? (londondaily.news)

1 points by cybleinc 23m ago 1 comments

Pipeline as Code (medium.com)

1 points by gk1 25m ago 0 comments

Surveillance Firm Bypasses SS7 Protections to Retrieve User Location (securityweek.com)

1 points by walterbell 28m ago 0 comments

How younger people interact with the modern internet (social.restless.systems)

3 points by CursedSilicon 29m ago 2 comments

We built ClearWork to reveal how work gets done (not what's in the SOP) (clearwork.io)

2 points by abrooks43 30m ago 1 comments

Apple Introduces AppleCare One (apple.com)

7 points by ingve 31m ago 0 comments

The Global Flourishing Study: Study Profile and Initial Results on Flourishing (nature.com)

1 points by RickJWagner 32m ago 1 comments

Glaze: Fast, in memory, JSON and reflection library for modern C++ (github.com)

1 points by klaussilveira 33m ago 0 comments

A startup is a bundle, and Windsurf broke the bundle (danco.substack.com)

1 points by surprisetalk 35m ago 0 comments

Show HN: I built Keynote but for creating motion graphics (with Excalidraw) (storymotion.video)

1 points by chunza2542 35m ago 0 comments

Industrial Colossus: China vs. 1950s America (cogitations.co)

2 points by surprisetalk 35m ago 0 comments

Watermarks offer no defense against deepfakes (uwaterloo.ca)

2 points by hhs 35m ago 0 comments

No Country Ever Got Rich from Tourism (palladiummag.com)

1 points by surprisetalk 35m ago 0 comments

Video and transcript of talk on "Can goodness compete?" (joecarlsmith.com)

1 points by surprisetalk 36m ago 0 comments

ICML Statement about subversive hidden LLM prompts (icml.cc)

1 points by hardmaru 36m ago 0 comments

Building with AI: Substrate, Agents, Workflow (zo.computer)

1 points by benzguo 38m ago 1 comments

Want Discriminated Unions in Kotlin? Use a Sealed Class (spin.atomicobject.com)

1 points by ingve 39m ago 0 comments

Show HN: Open IT Maintenance Planner (maintenance-planner.vangemert.dev)

1 points by spmvg 40m ago 0 comments

Tinybird made a ClickHouse CLI agent (tinybird.co)

1 points by _peregrine_ 42m ago 0 comments

Scalable Chrysopoeia via (N,2n) Reactions Driven by Deuterium-Tritium Fusion (arxiv.org)

2 points by ahlCVA 42m ago 0 comments

Show HN: I built a free math-based puzzle game called Equatile (equatile.com)

2 points by bbx 43m ago 1 comments

List of single-file C/C++ libraries (github.com)

1 points by okl 43m ago 0 comments

Are we Trek yet? – A guide for how close we are to Star Trek technology (arewetrekyet.com)

16 points by MattSayar 43m ago 15 comments

Lumo private AI chat by Proton (lumo.proton.me)

1 points by esher 44m ago 0 comments

Martin Van Buren responsible for the tiny word that punches above its weight? (npr.org)

2 points by Bluestein 45m ago 0 comments

Obese mice live 26% longer with a single protein overexpression (medicalxpress.com)

1 points by PaulHoule 45m ago 0 comments

From Canada to Finland, a US neo-Nazi fight club is spreading across the globe (theguardian.com)

2 points by ljf 46m ago 1 comments

I Replaced Obsidian and Emacs Org-Mode with localStorage (htmlsync.io)

1 points by meistertigran 50m ago 0 comments

All an Experiment (blog.turbine.ai)

1 points by laci37 51m ago 0 comments

Maximal number of triangles made by 31 lines found (299 triangles) (reddit.com)

2 points by IdealeZahlen 56m ago 0 comments

Python Free-Threading Guide (py-free-threading.github.io)

2 points by ngoldbaum 57m ago 0 comments

Return of wolves to Yellowstone has led to a surge in aspen trees (livescience.com)

3 points by geox 58m ago 0 comments

More than 100 NGOs warn Israel's forced 'mass starvation' stalks all Gaza (aljazeera.com)

3 points by NomDePlum 1h ago 0 comments

Optimizing to Remove Political Bias from AI Models Removes Other Types of Bias (askrally.com)

1 points by virtual_rf 1h ago 2 comments

DeepMind and OpenAI achieve IMO Gold. What does it all mean? (garymarcus.substack.com)

1 points by nsoonhui 1h ago 0 comments

Proton's New Lumo AI Is All About Privacy (fastcompany.com)

1 points by doctaj 1h ago 0 comments

Show HN: I Am 15 and Built a Dual Backend MLP from Scratch Using CUDA C++

1 muchlakshay 2 7/23/2025, 6:59:29 AM github.com ↗

hii everyone! I'm a 15-year-old and I just completed a dual backend MLP from scratch that supports both CPU and GPU (CUDA) training.

for the CPU backend, I used only Eigen for linear algebra, nothing else.

for the GPU backend, I implemented my own custom matrix library in CUDA C++. The CUDA kernels aren’t optimized with shared memory, tiling, or fused ops (so there’s some kernel launch overhead), but I chose clarity, modularity, and reusability over a few milliseconds of speedup.

that said, I've taken care to ensure coalesced memory access, and it gives pretty solid performance, around 0.4 ms per epoch on MNIST (batch size = 1000) using an RTX 3060.

This project is a big step up from my previous one. It's cleaner, well-documented, and more modular.

I’m fully aware of areas that can be improved, and I’ll be working on them in future projects. My long-term goal is to get into Harvard or MIT, and this is part of that journey.

would love to hear your thoughts, suggestions, or feedback

ive attached the link to my GitHub Repo

Comments (2)

onelli · 6h ago

Love seeing young devs shipping real projects! Out of curiosity, have you tried benchmarking your MLP on any real-world data sets, or was this mainly about learning CUDA/C++? (And what’s the biggest gotcha you ran into?)

muchlakshay · 5h ago

thanks!!!! appreciate that a lot. i’ve mainly tested it on MNIST for now, the CUDA backend trains one epoch in ~0.4ms (batch size 1000, RTX 3060, as i mentioned in the post). It was primarily a deep dive into CUDA/C++, manual memory management, and building a dual backend architecture with a custom matrix lib (GPU-side completely from scratch). this was actually my 4th serious attempt at building a GPU-based MLP from scratch. I failed multiple times, sometimes due to a single line of code. in earlier attempts, i had this optimization idea: store both the weights and their transposes in GPU memory, so i wouldn’t have to compute the transpose each epoch. Seemed clever, until training started failing badly. Turned out I was only updating the original weights matrix after backprop, but the transposed one was still holding stale values from earlier. this broke training completely, and I spent weeks trying to debug it, couldn’t figure it out until this current version.

honestly, the biggest gotchas were-

-memory coherence issues like above (esp. when trying to cache 'smartly')

-launching kernels in the right order while keeping data in sync

-maintainingg modularity without sacrificing too much performance

i avoided fused kernels/shared memory in this version to keep things clean and reusable, but now that the core works, I plan to start optimizing that layer too.