Show HN: Loft CLI – Fine-tune and run LLMs (1–3B) on 8 GB MacBook Air, no GPUs
I built *LoFT*, a lightweight CLI that turns any 8 GB laptop into a tiny LLM training and inference rig — no GPU, no cloud.
5 Commands: 1. `loft finetune` → Train LoRA adapters on CPU 2. `loft merge` → Merge adapters into model 3. `loft export` → Convert to GGUF (FP16) 4. `loft quantize` → Apply Q4_0 (4-bit) quantization 5. `loft chat` → llama.cpp CPU chat @ ~7 tok/s
Benchmarks on 8 GB MacBook Air: | Step | Time | Peak RAM | |-------------|--------|----------| | Finetune | 23 min (sample run) | 308 MB | | Merge | 4.7 min | 322 MB | | Quantize | 21 sec | 322 MB | | Inference | 6.9 tok/s | 322 MB |
Also ran a full 300-row Dolly finetune (2 epochs) in *~1.5 hours*, achieving *sub-1 loss* on CPU-only setup. No crashes, swap kills, or GPU needed.
Why this matters: - Makes local LLM customization accessible to devs without GPU access - Enables domain-specific agents (summarizer, support bot, Q&A) on commodity laptops - Everything runs via CPU (no CUDA, no cloud)
Would love feedback on: - UX improvements or edge cases - Adapter recipes you’d want (legal, summarization, customer support, etc.) - Cool things you’d build with low-RAM LLMs
MIT-licensed, 100% local. Feedback is very welcome.
– Diptanshu \
One thing that surprised us: on an 8 GB M2 Air, peak RAM never exceeded 330 MB during a full 300-sample finetune (2 epochs) — thanks to gradient checkpointing, which reduces memory usage by recomputing activations instead of storing them.
If anyone tries LoFT on Windows or Linux, I’d love to hear your first-token latency with `loft chat`. On macOS we see ~145 ms/token with TinyLlama + GGUF.