Ask HN: How to increase LLM inference speed?

1 InkCanon 0 6/15/2025, 10:08:28 AM

Hi HN,

I'm building software that has a very tight feedback loop with the user. One part involves a short (few hundred tokens) response from an LLM. By far this is the biggest UX problem - currently DeepSeek's total time taken can reach 10 seconds, which is horrific. Would it be possible to practically reduce the speed to maybe ~2 seconds? The LLM just asks to rephrase (while preserving meaning) of a short text, so it does not need to be SOTA. On the whole faster inference time is much more important.

Ask HN: How to increase LLM inference speed?

Show HN: Building Hugo – An AI coding agent (hugo-agent.surge.sh)

Ask HN: What's the coolest AI project you've seen?

Ask HN: AGI and Product Development

23andMe founder buys back genetic testing company in second auction (ft.com)

Short: What happens if you turn the horizon into music? [video] (youtube.com)

Creating One platform from travel planning to booking

The Intelligence Curse (intelligence-curse.ai)

Ask HN: Is it still a good idea to learn Perl for a young developer?

Setting Up a Free One-Handed Touch-Typing System on Your PC (99percentinvisible.org)

Is there interest in an informal proof that Graph Vertex 3-Colouring is NPC?

Notes on the History of the Map Tile (placing.technology)

Notes on Managing ADHD (borretti.me)

Lincoln Steampunk Festival (visitlincoln.com)

Vesna Vulović (en.wikipedia.org)

Meta AI searches made public – but do all its users realise? (bbc.com)

NASA jettisons Neo4j database for Memgraph citing costs (theregister.com)

In Munich, early signs of a European hyperscaler revolt (raconteur.net)

I Am So Tired

The De Man Case (2014) (newyorker.com)

Writing Load Balancer from Scratch in 250 Line of Code (beyondthesyntax.substack.com)

What Will the World Cup Stand for in an Isolationist America? (theringer.com)

De-EscalatingSocial Media – Designing humility and forgiveness into social media (nickpunt.com)

India reportedly suspending rare earth exports to Japan amid domestic needs (ca.finance.yahoo.com)

Google is killing Android Instant Apps, but you probably won't miss them (androidauthority.com)

ChatGPT: H1 2025 Strategy [pdf] (substack-post-media.s3.us-east-1.amazonaws.com)

Aged Garlic Extract (Age) (domofutu.substack.com)

Show HN: Un.limited.mx – Free Email Testing Service for Dev/QA Teams (un.limited.mx)

I Wrote a Compiler (blog.singleton.io)

Lessons from 9 More Years of Tricky Bugs (henrikwarne.com)

Matriny (matriny.africa)

PDF-LIB · Create and modify PDF documents in any JavaScript environment (pdf-lib.js.org)

Extraterrestrial Intelligence and Immortality (rxjourney.net)

My Pharmacy's $2M Ozempic Problem (drugstorecowboy.com)

Interview Copilot: Real-Time AI Help for Tech Interviews (Private GitHub Repo) (interviewcopilot.info)

An alternative to Big Tech's video platforms (joinpeertube.org)

Coding agents have crossed a chasm (blog.singleton.io)

Have LLMs Mastered Geolocation? (bellingcat.com)

A Checklist for Decision-Making (economist.com)

Tiny-diffusion: A minimal implementation of probabilistic diffusion models (github.com)

Show HN: I made a tool to merge multiple JSON files online – free and fast (merge-json-files.com)

Get your compliance automated now (horuscheck.io)

Show HN: Made a 3 SEC log streaming setup (paste command –> streaming starts) (logsy.info)

Disaster Party – A "Universal" AI API SDK (github.com)

The Art of Lisp and Writing (dreamsongs.com)

A Parting Message to My Students (kstan.gitlab.io)

Dead Hand automatic nuclear weapons control system (en.wikipedia.org)

Trade with China Is Becoming a One-Way Street (wsj.com)

Show HN: Mdc – just another Markdown viewer with ToC and CLI support (github.com)

Government awards contract to French company to develop sonar system (rte.ie)

Ask HN: How to increase LLM inference speed?

Comments (0)