Ask HN: How are you managing LLM inference at the edge?

7 gray_amps 2 5/8/2025, 5:06:08 PM

I’m building a system to run small LLMs on-device (mobile, IoT, on-prem servers) and would love to hear how others have tackled the challenges.

Context:

Use cases: offline chatbots, smart cameras, local data privacy

Models: 7–13B parameter quantized models (e.g. Llama 2, Vicuna)

Constraints: limited RAM/flash, CPU-only or tiny GPU, intermittent connectivity

Questions:

What runtimes or frameworks are you using (ONNX Runtime, TVM, custom C++)?

How do you handle model loading, eviction, and batching under tight memory?

Any clever tricks for quantization, pruning, or kernel fusions that boost perf?

How do you monitor and update models securely in the field?

Looking forward to your benchmarks, war stories, and code pointers!

Comments (2)

byte-bolter · 53d ago

I’m using ONNX Runtime with 4-bit quantization on a Raspberry Pi 4. I preload the quantized model into shared memory so multiple processes can reuse it. Evict old sessions by LRU when I hit a 1 GB RAM cap. For batching, I accumulate inputs over 50 ms to boost throughput without hurting latency. So far I get ~15 RPS on a 7 B Llama 2 model.

tynskid2025 · 49d ago

Do you have a repo outline of how you did this, I would be so grateful

Runestone That May Be North America's Oldest Turns Up in a Canada Forest (nytimes.com)

Finding and understanding bugs in C compilers [pdf] (cse.unr.edu)

Seven Ways to Call United Airline (leatherworker.net)

Show HN: Voxica – Bring your own voice to immersive audio storytelling (voxica.io)

From an idea to an ML paper in under 1 hour (letters.lossfunk.com)

Context Engineering for Agents (rlancemartin.github.io)

Stop manually pentesting. I built an AI that thinks like a red team (zenthex.online)

The Border Politics of Patents and the Immigrant Inventor (texaslawreview.org)

ASCIIMoon: The moon's phase live in ASCII art (asciimoon.com)

Love Arc but stuck on Chrome? Arc Stage brings the best of Arc to the browser (chromewebstore.google.com)

Cloudflare will now block AI crawlers by default (theverge.com)

Unregulated tech tests: Thiel, Altman and co want Freedom Cities (heise.de)

Crypto Gotchas: Domain Separation (gotchas.salusa.dev)

Caching is an Abstraction, not an Optimization (buttondown.com)

People have empathy with AI as long as they think it's human (theregister.com)

Musk vows to unseat lawmakers who support Trump's 'big beautiful bill' (theguardian.com)

Combining jinja2-CLI with jq and environment variables (zufallsheld.de)

Ask HN: The Proof or Bluff paper. Can "AI" do math?

Scientists identify culprit behind biggest-ever U.S. honey bee die-off (science.org)

Neutrino: Probing-Based eBPF-Like GPU Kernel Profiling (github.com)

Show HN: Worflows.py, the best way to build agents (github.com)

Sound Chip, whisper me your secrets [video] (media.ccc.de)

The Shifting State of Xiaohongshu (Little Red Book) (meghanboilard.substack.com)

Links for July 2025 (astralcodexten.com)

The average chess players of Bletchley Park and AI research in Britain (blogs.bl.uk)

Context Engineering and Context Design may be where we can work together (medium.com)

Tiny Agents: an MCP-powered agent in 50 lines of code (huggingface.co)

Cloudflare to block AI firms from scraping content without consent (cnbc.com)

Cloudflare introduces pay-per-crawl for AI bots (blog.cloudflare.com)

Using Claude Code to Build a GitHub Actions Workflow (twitter.com)

State of AI Bias in Talent Acquisition (warden-ai.com)

Content Independence Day: no AI crawl without compensation (blog.cloudflare.com)

An Algorithm for a Better Bookshelf (cacm.acm.org)

Official EU Website Exploited to Advertise Shady IPTV Services (torrentfreak.com)

A company in Germany is still selling new portable reel-to-reel tape machines (sds-consult.de)

OpenAI Leadership Responds to Meta Offers: 'Someone Has Broken into Our Home' (wired.com)

ToplingDB – the compatible superior replacement for RocksDB (github.com)

The AI Content Trap: When Easy Creation Kills Retention (breadcrumb.vc)

Permission-Based Approach Makes Way for a New Business Model (cloudflare.com)

If you care about money or power, stay close to AI (twitter.com)

Version Control for AI Coding (branching.app)

Switching on a silent gene revives tissue regeneration in mice (phys.org)

Helsinki appeal court convicts two HS journalists of treason (yle.fi)

The Open Source Terraform Provider for OpenAI (mkdev.me)

Singapore police can now seize bank accounts to stop scams (bbc.com)

Is Peter Thiel the Antichrist? NYT Didn't Think to Ask (thenerdreich.com)

Show HN: Semcheck – AI Tool for checking implementation follows spec (github.com)

AI consciousness is not impossible – but LLMs of today must not be conscious (github.com)

UK Uses Tech to Reform Services, Raising Concerns over Transparency and Control (sfg.media)

Escher's Art and Computer Science (github.com)

Ask HN: How are you managing LLM inference at the edge?

Comments (2)