Show HN: Run 30B model in 4GB Active Memory

4 vkkhare 1 6/5/2025, 5:13:52 PM github.com ↗

We have built fused operator kernels for structured contextual sparsity to avoid loading and computing activations with feed forward layer weights that eventually zero out by the activation.

The result? We are seeing 5X faster MLP layer performance in transformers with 50% lesser memory consumption avoiding the sleeping nodes in every token prediction. For Llama 3.2, Feed forward layers accounted for 30% of total weights and forward pass computation resulting in 1.6-1.8x increase in throughput:

Sparse LLaMA 3.2 3B vs LLaMA 3.2 3B (on HuggingFace Implementation):

- Time to First Token (TTFT): 1.51× faster (1.209s → 0.803s) - Output Generation Speed: 1.79× faster (0.7 → 1.2 tokens/sec) - Total Throughput: 1.78× faster (0.7 → 1.3 tokens/sec) - Memory Usage: 26.4% reduction (6.125GB → 4.15GB)

Find the operator kernels with differential weight caching open sourced at github.com/NimbleEdge/sparse_transformers. Lets get LLMs sprinting!

Comments (1)

nrjpoddar · 8h ago

Link github/sparse_transformers seems to be broken

Elon Musk says SpaceX will decommission Dragon spacecraft after Trump threat (cnbc.com)

Harvard Wins Reprieve from Trump's Foreign Student Ban (bloomberg.com)

Anonymous posts Taiwan flag on Russian sites after Operation Spider's Web (taiwannews.com.tw)

Gemini Live rolling out captions for quiet conversations (9to5google.com)

Unsafe sax: study of impact of too much sax on the mortality of jazz musicians (1999) (pmc.ncbi.nlm.nih.gov)

Ask HN: Walking while working and having meetings

Muon g-2 announces most precise measurement of the magnetic anomaly of the muon (news.fnal.gov)

Science Proceeds One Question at a Time (scholars-stage.org)

Mastering Claude Code in 30 minutes with its creator, Boris Cherny [video] (youtube.com)

Anker comes clean about its Eufy security cameras (2023) (theverge.com)

From AI-friendly to AI-first: How Zapier is transforming hiring and onboarding (zapier.com)

Show HN: A Simple Tool to Copy Special Characters and Symbols Easily (special-characters.aitoolshubs.com)

Ask HN: Who's Using the Origin Private File System?

Open Source Distilling (opensourcedistilling.com)

Assume people use reader mode (catskull.net)

Scientists at Loughborough University Create 'Worlds' Smallest Violin' (bbc.com)

LawConnect is a new AI that helps you understand your legal issues, for free. (lawconnect.com)

We decreased Gitlab repo backup times from 48 hours to 41 minutes (about.gitlab.com)

A conversation on Claude Code with Boris Cherny, the creator [video] (youtube.com)

The Game of Snake and the Nature of Order (akkartik.name)

How do you prototype a nice language? (kevinlynagh.com)

Astronomers observe the Sausage cluster at low radio frequencies (phys.org)

RTX 5060 Ti 16GB sucks for gaming, but seems like a diamond in the rough for AI (old.reddit.com)

America's skies may soon open up to supersonic travel (cnn.com)

Nyreth Framework – A Symbolic Language and Cognitive Substrate for AI

Tanaka Isson (jacobfilipp.com)

What if Disney World was rebuilt for cars instead? [video] (youtube.com)

When was peak message in a bottle? (interconnected.org)

Decrease in Entry-Level Tech Jobs (newsletter.eng-leadership.com)

Test Postgres in Python Like SQLite (github.com)

The Loneliness Epidemic, in Data: Who Americans Spend Time with [video] (youtube.com)

Cosmic (cosmic.new)

Debugging Deadlocks in PostgreSQL (cybertec-postgresql.com)

Ask HN: Is Adrian Colyer of "The Morning Paper" fame ok?

Better Contract Drafting (2023) (oncontracts.com)

NATS Server 2.11 Release (nats.io)

LISA: Linux Integration Services Automation by Microsoft (github.com)

UK Court Rules on Reverse Engineering of Mainframe Software (jdsupra.com)

Nano-structured antibiofilm coatings based on recombinant resilin (sciencedirect.com)

How we’re responding to The NYT’s data demands in order to protect user privacy (openai.com)

Senate response to White House budget for NASA: Keep SLS, Nix science (arstechnica.com)

Anthropic co-founder on cutting access to Windsurf (techcrunch.com)

Conti Ransomware gang hackers exposed with photo identity via cyber attack (cybersecurity-insiders.com)

Spegion: Implicit and Non-Lexical Regions with Sized Allocations (arxiv.org)

LowProfile – Mac utility to help inspect Apple Configuration Profile payloads (github.com)

Benny is a modular software playground for making live music (playbenny.github.io)

Ispace SMBC X Hakuto-R Venture Moon: Post Landing Conference (youtube.com)

Fabric Chat – AI Multiplayer Chat (usefabric.ai)

Champion-level drone racing using deep reinforcement learning (2023) (nature.com)

Show HN: Bearchat.ai, the place to test AI models (bearchat.ai)

Show HN: Run 30B model in 4GB Active Memory

Comments (1)