Ask HN: How are you managing LLM inference at the edge?

4 gray_amps 1 5/8/2025, 5:06:08 PM
I’m building a system to run small LLMs on-device (mobile, IoT, on-prem servers) and would love to hear how others have tackled the challenges.

Context:

Use cases: offline chatbots, smart cameras, local data privacy

Models: 7–13B parameter quantized models (e.g. Llama 2, Vicuna)

Constraints: limited RAM/flash, CPU-only or tiny GPU, intermittent connectivity

Questions:

What runtimes or frameworks are you using (ONNX Runtime, TVM, custom C++)?

How do you handle model loading, eviction, and batching under tight memory?

Any clever tricks for quantization, pruning, or kernel fusions that boost perf?

How do you monitor and update models securely in the field?

Looking forward to your benchmarks, war stories, and code pointers!

Comments (1)

byte-bolter · 5h ago
I’m using ONNX Runtime with 4-bit quantization on a Raspberry Pi 4. I preload the quantized model into shared memory so multiple processes can reuse it. Evict old sessions by LRU when I hit a 1 GB RAM cap. For batching, I accumulate inputs over 50 ms to boost throughput without hurting latency. So far I get ~15 RPS on a 7 B Llama 2 model.