Show HN: InferX – an AI-native OS for running 50 LLMs per GPU with hot swapping
3 pveldandi 2 4/17/2025, 2:51:53 PM
Hey folks , We’ve been building InferX. an AI-native runtime that snapshots the full GPU execution state of LLMs (weights, KV cache, CUDA context) and restores it in under 2s. This lets us hot-swap models like threads. no reloading, no cold starts.
We treat each model as a lightweight, resumable process. like an OS for LLM inference.
Why it matters:
•Run 50+ LLMs per GPU (7B–13B range)
•90% GPU utilization (vs ~30–40% with conventional setups)
•Avoids cold starts by snapshotting and restoring directly on GPU •Designed for agentic workflows, toolchains, and multi-tenant use cases
•Helpful for Codex CLI-style orchestration or bursty multi-model apps
Still early, but we’re seeing strong interest from builders and infra folks. Would love thoughts, feedback, or edge cases you’d want to see tested.
Demo: https://inferx.net X: @InferXai
Comments (2)
sauravt · 12d ago
Very interesting. How would memory (or previous chat context awareness) work in the case of hot swapping, when multiple users to hot swapping models like threads.
precompute · 12d ago
Wow, that's really cool!