Show HN: FlexLLama – Run multiple local LLMs at once with a simple dashboard
4 yazon 0 8/5/2025, 6:30:54 AM github.com ↗
After playing around with local AI setups for a while, I kept getting annoyed at having to juggle different llama.cpp servers for each model. Switching between them was such a pain and I always had to restart things just to load up a new model.
So I ended up building something to fix that. It's called FlexLLama - https://github.com/yazon/flexllama
Basically, it's a tool that lets you run multiple llama.cpp instances easily, spread across CPU and GPUs if you got'em. Everything sits behind a single OpenAI-compatible API.
You can run chat models, embeddings, rerankers - all at once. The models assigned to the runners are reloaded on the fly.
There's a little web dashboard to monitor and manage runners.
It's super easy to get started: just pip install from the repo, or grab the Docker image for a speedy setup.
I've been using it myself with things like OpenWebUI and some VS Code extensions (Roo Code, Cline, Continue.dev), and it works flawlessly.
No comments yet