The technical achievement: Got it down to 5.1MB by stripping everything
except pure inference. Written in Rust, uses llama.cpp's engine.
One feature I'm excited about: You can use LoRA adapters directly without
converting them. Just point to your .gguf base model and .gguf LoRA -
it handles the merge at runtime. Makes iterating on fine-tuned models
much faster since there's no conversion step.
Your data never leaves your machine. No telemetry. No accounts. Just a
tiny binary that makes GGUF models work with your AI coding tools.
Would love feedback on the auto-discovery feature - it finds your models
automatically so you don't need any configuration.
What's your local LLM setup? Are you using LoRA adapters for anything specific?
carlos_rpn · 1d ago
You may have noticed already, but the link to the binary is throwing a 404.
MKuykendall · 1d ago
This should be fixed now!
stupidgeek314 · 18h ago
Windows Defender tripped this for me, calling it out as Bearfoos trojan. Most likely a false positive, but jfyi.
MKuykendall · 8h ago
Try cargo install or intentionally exclude, unsigned Rust binaries will do this.
homarp · 1d ago
Nice, a rust tool wrapping llama.cpp
how does it differ from llama-server?
and from llama-swap?
MKuykendall · 1d ago
Shimmy is designed to be "invisible infrastructure" - the simplest possible way to get local inference working with your existing AI tools. llama-server gives you more control, llama-swap gives you multi-model management.
Key differences:
- Architecture: llama-swap = proxy + multiple servers, Shimmy = single server
- Resource usage: llama-swap runs multiple processes, Shimmy = one 50MB process
- Use case: llama-swap for managing many models, Shimmy for simplicity
MKuykendall · 1d ago
Shimmy is for when you want the absolute minimum footprint - CI/CD pipelines, quick local testing, or systems where you can't install 680MB of dependencies.
cat-turner · 10h ago
looks cool, ty! really great project will try this out.
Quick demo - working VSCode + local AI in 30 seconds: curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/late... ./shimmy serve # Point VSCode/Cursor to localhost:11435
The technical achievement: Got it down to 5.1MB by stripping everything except pure inference. Written in Rust, uses llama.cpp's engine.
One feature I'm excited about: You can use LoRA adapters directly without converting them. Just point to your .gguf base model and .gguf LoRA - it handles the merge at runtime. Makes iterating on fine-tuned models much faster since there's no conversion step.
Your data never leaves your machine. No telemetry. No accounts. Just a tiny binary that makes GGUF models work with your AI coding tools.
Would love feedback on the auto-discovery feature - it finds your models automatically so you don't need any configuration.
What's your local LLM setup? Are you using LoRA adapters for anything specific?
how does it differ from llama-server?
and from llama-swap?