How to Give Your RTX 4090 Nearly Infinite Memory for LLM Inference

Comments (1)

dikobraz · 2h ago

We explored a network-attached KV-cache for consumer GPUs to offset their limited VRAM. It doesn’t make RTX cards run giant models efficiently. Still, for workloads that repeatedly reuse lengthy prefixes—such as chatbots, coding assistants, and multi-turn threads—it delivers a 2–4× speedup in RPS and time-to-first-token on 7B and 70B models.

How it works: On return visits, instead of re-running the prompt through the model, we fetch previously computed KV blocks from network storage and skip re-computing those tokens (i.e., we avoid re-running prefill on repeated prefixes). This is helpful when VRAM can’t hold all sessions, and users pause between messages, which is almost always the case.

Why RTX benefits: Prefill is the computationally intensive part (quadratic attention, numerous reductions, and inter-GPU traffic). Without NVLink, PCIe becomes the choke point in multi-GPU setups. KV-caching cuts repeated prefill, leaving mostly the lighter decoding step—something PCIe-only RTX nodes handle well.

Results & endpoint: - 2–4× speedup on multi-turn benchmarks (RPS & TTFT) with RTX 4090. - We’ve opened one free public endpoint for demos, not production grade (https://console.cloudrift.ai/inference?modelId=meta-llama%2F...). Ping us at hello@cloudrift.ai if you need a reliable setup.

Technical Notes: - Works with consumer and data-center GPUs. In theory, you can even split roles: NVLink boxes do prefill, while cheaper RTX pods serve as decoders using stored KV. - We use special hardware to reduce fetch overhead and offload the CPU, but you can reproduce this at home with a regular NAS (with lower peak performance). - For a more in-depth walkthrough of the math and architecture of a KV-cache solution, please watch this video from the KV-cache solution vendor (https://www.youtube.com/watch?si=T69vxku8xPr6p7I0&v=CV4FYMTF...)

Show HN: Generate polished reports/docs automatically from messy inputs (gridfusion.ai)

CRDT: Text Buffer by Evan Wallace (madebyevan.com)

The SSO Wall of Shame – Vendors that treat SSO as luxury feature (sso.tax)

Domain Fronting (en.wikipedia.org)

New FBI case files reveal suspects, tips and hoaxes in DB Cooper plane hijacking (abc.net.au)

The sokol-gfx resource view update (floooh.github.io)

Building free tax filing app for US employees (tax-employees.web.app)

Show HN: Lemonade: Run LLMs Locally with GPU and NPU Acceleration (github.com)

Theo de Raadt: cccccblddbkhttjnhvbufcvrtggtvvfnuviieecckfcg (marc.info)

Life on an Outdated Kernel (kernel-5mp.pages.dev)

Hacking Toniebox (20y.hu)

Show HN: Trajectory.fyi – Compare people and companies by age

Compilers Aren't Just for Programming Languages (architecture-weekly.com)

Ukrainian Sniper Sets New Record for Longest Confirmed Engagement (13,000ft/4km) (militarnyi.com)

Princeton Researchers and Forum Veterans Are Fighting over AI Optimization (generative-engine.org)

Against Breathalyzers (newpolity.com)

Webb telescope finds a new tiny moon around Uranus (apnews.com)

3D printing reshapes construction for nuclear energy (techxplore.com)

Tool Time Session: Emacs Basics [video] (youtube.com)

APIs don't make good MCP tools (reillywood.com)

The next 10 years won't be about AI knowing, they will be about AI doing (freethink.com)

The On-Line Encyclopedia of Integer Sequences (oeis.org)

How the Mafia Infiltrated Germany (unherd.com)

Using Sound to Remember Quantum Information (caltech.edu)

ArduinoOS (github.com)

The global car reckoning is here, far too many auto companies don't have a plan (wired.com)

Here Comes the World Wide Web of Everything connects devices, robots, AI agents (spectrum.ieee.org)

SmallJS (small-js.org)

An Unbiased and Objective Climate Science Report (realclearscience.com)

The myth of Scouse exceptionalism (unherd.com)

What Your Favorite Map Projection Says About You (imgs.xkcd.com)

Join Session Reply Beta (rollbar.framer.website)

SoftBank held talks with Intel on buying contract chipmaking business (ft.com)

Forced software updates just make everything worse (theguardian.com)

Syslog Without Server Headaches (proxylity.com)

Show HN: Web Agent Memory Protocol (WAMP) – Building Shared Memory for the Web (web-agent-memory.github.io)

Digital piracy: how we got here (zy.bearblog.dev)

AI-generated responses are undermining crowdsourced research studies (newscientist.com)

Ask HN: Are there any good educational computers for children that aren't toys?

Tea's founder lured millions of women to spill their secrets, then exposed them (404media.co)

Cracking abandonware DRM like it's 1999 (hackaday.com)

Metaheuristic to Emulate Google Deepthink – LocalLlama (github.com)

I Hacked McDonald's (Security Contact Was Harder to Find Than Secret Recipe) (bobdahacker.com)

Colin Cummings, the greatest air hockey player of all time (theguardian.com)

Datacenter vacancy rates near zero - $1T of new deployment needed by 2030 (theregister.com)

The Chatbot Updated. Users Lost a Friend (nytimes.com)

Thirty Years Later: Lessons from the Multics Security Evaluation (2002) [pdf] (acsac.org)

AnduinOS (anduinos.com)

Hows my site any suggestion to improve? (likhitnepal.com)

Coding from Your Phone (devfleet.ai)

How to Give Your RTX 4090 Nearly Infinite Memory for LLM Inference

Comments (1)