Lossless LLM 3x Throughput Increase by LMCache

Comments (2)

lihanc111 · 4h ago

Our team has built this open source project, LMCache, to reduce repetitive computation in LLM inference and make systems serve more people (3x more throughput in chat applications) and it has been used in IBM's open source LLM inference stack.

In LLM serving, the input is computed into intermediate states called KV cache to further provide answers. These data are relatively large (~1-2GB for long context) and are often evicted when GPU memory is not enough. In these cases, when users ask a follow up question, the software needs to recompute for the same KV Cache. LMCache is designed to combat that by efficiently offloading and loading these KV cache to and from DRAM and disk.

Ask us anything!

0xjunhao · 4h ago

Hi, I had a quick question. Would it be correct to say the following?

1. For long inputs and short outputs, the inference can be arbitrarily number of times faster, as it avoids repeated KV computation.

2. Conversely, for short inputs and long outputs, it might be slightly slower, since loading and storing the KV cache are on the critical path of the execution.

Using Lxcfs Together with Podman (die-welt.net)

Lessons from LangChain and Slack and MCP Integration (medium.com)

Use of ch unit considered inappropriate (in certain circumstances) (clagnut.com)

Brit Watchdog Cracks Down on Data Collection by Smart TVs, Speakers, Air Fryers (theguardian.com)

Thoughts on the AI 2027 Discourse (dynomight.substack.com)

Childhood and Education #10: Behaviors (thezvi.substack.com)

When Can I Stop Listening to My Enemy's Points? (substack.com)

Show HN: Letter Lockbox – A word game I built over the weekend with Claude Code (letterlockbox.com)

Programmers and Their Blogs (lambdaland.org)

Ask HN: What's your fastest conversion from cold outreach to prepaid client?

Namespaced Pundit Policies Without the Repetition Racket (alec-c4.com)

The Legacy of "The Gastronomical Me" (lithub.com)

Show HN: How Usage Works (usage.ai)

Why Your Car's Touchscreen Is More Dangerous Than Your Phone (carsandhorsepower.com)

Dr. Dobb's (drdobbs.com)

Joining CNCF as Executive Director: Let's Build What's Next (cncf.io)

Elisa: A Comprehensive Guide to Enzyme-Linked Immunosorbent Assay (clyte.tech)

Secure your Express application APIs in 5 minutes with Cedar (aws.amazon.com)

Why Paris's Centre Pompidou, not even 50 years old, must close for five years (lemonde.fr)

Curated realities: An AI film festival and the future of human expression (arstechnica.com)

Scientists can now target the cells at the center of ALS (alleninstitute.org)

Haflang: Hardware Acceleration of Functional Languages (haflang.github.io)

Waldo – Geoip Lookups (geoip.dpdns.org)

David Friedberg: it is important for America that Mamdani get elected (twitter.com)

Portable Network Graphics (PNG) Specification (Third Edition) (w3.org)

EU lawmakers vote to bar carry-on luggage fees on planes (france24.com)

I Designed UX for an AI Product Last Year. Are Those Lessons Still Valid? (uxdesign.cc)

The Sun is twisting Mercury's crust in unexpected ways (bgr.com)

How to (Almost) solve cybersecurity once and for all (adaptive.live)

I Love GitOps (newsletter.masterpoint.io)

What It's Like to Be 'Mind Blind' (time.com)

Embabel: Framework for Building AI Agents with Java (thenewstack.io)

Epic Games and Qualcomm Are Bringing Fortnite to Windows 11 on Arm (thurrott.com)

Marginalia mania: how 'annotating' books went from no-no to BookTok's next trend (theguardian.com)

The AI Revolution: Human like interfaces, not intelligence (jaimefh.com)

Snyk Acquires Invariant Labs (snyk.io)

The Secret Rules of the Terminal (wizardzines.com)

Scaling Pinterest ML Infrastructure with Ray: From Training to ML Pipelines (medium.com)

Show HN: I built an AI thumbnail generator for YouTubers who can't design (thumbo.io)

Amish company embraced robots–then made an even bolder bet (fortune.com)

AI doesn't have to reason to take your job (vox.com)

The Reenchanted World: On finding mystery in the digital age (harpers.org)

Adding to markwhen documents via SMS and email (docs.markwhen.com)

Alcohol-soaked star system could explain why life, including us was able to form (livescience.com)

Personal Copilot: Train Your Own Coding Assistant (huggingface.co)

Agency is your secret edge (alanwu.xyz)

Stealthy ship hull cuts through waves like butter (news.engin.umich.edu)

What's Predictive in an AI Persona? (askrally.com)

The German automotive industry wants to develop open-source software together (vda.de)

I wrote 280 articles about web scraping. Here's their index grouped by tag (github.com)

Lossless LLM 3x Throughput Increase by LMCache

Comments (2)