Teaching a small embedding model with LLMs to deliver GPT-like semantics in 10ms

Hey folks! I’m Matt, CEO at Instant Domain Search. Quick summary: we distilled LLM judgments into a 22.7M-parameter embedding model and optimized CPU inference to deliver sub-10ms latency for semantic domain matches (correlation ≈0.87 with GPT-4).

The post walks through our training signal, distillation choices, quantization, index layout, and production latency/CPU learnings.

We’re a small team of 4 engineers building free, wicked fast search tools. AMA or feedback welcome!

Teaching a small embedding model with LLMs to deliver GPT-like semantics in 10ms

Comments (1)