SparseLoCo: Communication-Efficient LLM Training

Comments (1)

synapz_org · 3h ago

Paper: https://arxiv.org/abs/2508.15706 Code: https://github.com/tplr-ai/SparseLoCo

Templar AI has developed SparseLoCo, a distributed training algorithm that achieves extreme compression ratios (1-3% sparsity + 2-bit quantization) while outperforming existing methods like DiLoCo and DeMo on both loss and communication efficiency.

The Core Problem

Training LLMs across data centers or over the internet is bottlenecked by communication: as model scale grows, each synchronization can require transferring hundreds of gigabytes of pseudo-gradients. DiLoCo reduces the frequency of synchronizations, but the communication remains dense and large. This makes distributed training impractical for many scenarios, especially internet-scale collaboration.

Technical Approach

Our key insight: The infrequent communication of DiLoCo can be aggressively compressed via TOP-k sparsification while improving performance.

Algorithm highlights:

* Replace global momentum with per-replica error feedback * Apply TOP-k magnitude compression (1-3% density) + 2-bit quantization to pseudo-gradients * Maintain infrequent communication (H=15-250 steps) like DiLoCo * Use chunked TOP-k for better parallelism and reduced index overhead

Results

Communication reduction: With >97× compression, SparseLoCo outperforms DiLoCo across all benchmarks. Sparse aggregation appears to provide regularization benefits beyond just compression.

Communication infrequency: Consistently outperforms DiLoCo across communication frequency ∈ {15, 30, 50, 100, 250} on 512M parameter models.

Real deployment: Currently running on Bittensor with a 70B model and 20 participants in the gather operation (out of many more total participants): 70 seconds communication with <500Mbps bandwidth. Our previous successful deployment of a medium sized (200B token) run of an 8B parameter model and 20 gather participants achieved communication average of 12 seconds vs 4.5 minutes compute time.

Key Technical Contributions

1. Local momentum approximation: Show that DiLoCo's global outer momentum can be well-approximated by local accumulators (>90% cosine similarity)

2. Error feedback as momentum: Demonstrate that TOP-k + error feedback naturally provides similar benefits to outer momentum

3. Sparse aggregation benefits: Find that sparse aggregation actually improves performance over dense methods—likely due to emphasis on high-saliency components

4. Extreme quantization: Error feedback enables 2-bit quantization without additional accumulators or performance drops

Implementation Details

* Chunked TOP-k (4096 elements/chunk) reduces index transmission overhead * Custom index compression: 8.9, 6.6, 5.6 bits per value for different sparsity levels * Drop-in replacement for DiLoCo all-reduce operations * Compatible with existing distributed training frameworks

Limitations & Future Work

* Tested on 512M parameter models (though deployed on 8-70B) * Chunk size optimization could be further explored * Random-k performs significantly worse than TOP-k

This work makes distributed training viable over commodity internet connections and opens possibilities for global AI training collaborations that were previously bandwidth-prohibited.

Don't Build Multi-Agents (cognition.ai)

Patrick Winston: How to Speak (2018) [video] (youtube.com)

Amazon has mostly sat out the AI talent war (businessinsider.com)

Implementing a Foil Sticker Effect (4rknova.com)

Raspberry Pi 5 support (OpenBSD) (marc.info)

Adaptive LLM routing under budget constraints (arxiv.org)

The future of 32-bit support in the kernel (lwn.net)

Anthropic to counteract usage of Claude Code for "vibe hacking" (anthropic.com)

Making Minecraft Spherical (bowerbyte.com)

Thoughts on (Amazonian) leadership (daemonology.net)

Cloudflare Radar: AI Insights (radar.cloudflare.com)

The buyer-pull and seller-push theories of sales (howtogrow.substack.com)

F1 in Hungary: Strategy and fast tire changes make all the difference (arstechnica.com)

Bear is now source-available (herman.bearblog.dev)

Ask HN: Who is hiring? (September 2025)

The ABC Programming Language (homepages.cwi.nl)

Towards Memory Specialization: A Case for Long-Term and Short-Term RAM (arxiv.org)

Optery (YC W22) Is Hiring in Engineering, Legal, Sales, Marketing (U.S., Latam) (optery.com)

One of Britain's largest stocks of second-hand books ever amassed (worldofinteriors.com)

Steve Ballmer Interview (acquired.fm)

The Tragic End of Natalia Nagovitsyna's Ordeal on Pobeda Peak (explorersweb.com)

Desert Graves (2021) (desertmountaineer.com)

Effective learning: Rules of formulating knowledge (1999) (supermemo.com)

Using JWT to establish a trusted context for Row Level Security (vondra.me)

Python: The Documentary – An origin story [video] (youtube.com)

Ask HN: Who wants to be hired? (September 2025)

CocoaPods trunk read-only plan (blog.cocoapods.org)

A Unique, High-Tech (Family) Computer (nicole.express)

A review of Nim 2: The good and bad with example code (miguel-martin.com)

Google AI Overview made up an elaborate story about me (bsky.app)

Search engine referral report for 2025 Q2 (radar.cloudflare.com)

We should have the ability to run any code we want on hardware we own (hugotunius.se)

Through the Liquid Glass (flarup.email)

An adventure in writing compatible systems (turso.tech)

India's billion-dollar e-waste empire (restofworld.org)

Tetris is NP-hard even with O(1) rows or columns (2020) [pdf] (martindemaine.org)

SparseLoCo: Communication-Efficient LLM Training (arxiv.org)

Can You Develop Film in a Jägerbomb? (petapixel.com)

UK's largest battery storage facility at Tilbury substation (nationalgrid.com)

Git for Music – Using Version Control for Music Production (2023) (grechin.org)

What brain surgery taught me about the fragile gift of consciousness (bigthink.com)

The time picker on the iPhone's alarm app isn't circular, it's just a long list (old.reddit.com)

Show HN: Woomarks, transfer your Pocket links to this app or self-host it (woomarks.com)

What Is Complexity in Chess? (lichess.org)

Reports of Gmail security issue are inaccurate (blog.google)

Territorial Markings as a Predictor of Driver Aggression and Road Rage (2008) (onlinelibrary.wiley.com)

Preserving Order in Concurrent Go Apps: Three Approaches Compared (destel.dev)

I Was Wrong About Data Center Water Consumption (construction-physics.com)

Eternal Struggle (yoavg.github.io)

Zfsbackrest: Pgbackrest style encrypted backups for ZFS filesystems (github.com)

SparseLoCo: Communication-Efficient LLM Training

Comments (1)