How we’re responding to The NYT’s data demands in order to protect user privacy (openai.com)

I’m honestly impressed that a pure python implementation can beat out vLLM and SGLang. Granted they lean on FlashInfer, and of course torch.compile has gotten incredibly powerful in the last few years. Though dynamic shapes have still been a huge thorn in my side, I’ll need to look closer at how they pulled it off…

symbolicAGI · 4h ago

Given chat and API needs for low-latency, llama.cpp is probably still the best choice for self hosted models with or without GPU support. And Ollama is the leader for wrapping llama.cpp.

Because Tokasaurus was mentioned as better than Ollama for conducting darwinian godel machine operations (self-improvement), I looked for the linked repo on GitHub and it was 404. So glad it is back https://github.com/ScalingIntelligence/tokasaurus.

nabakin · 4h ago

> On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+.

Looks like they don't compare to TensorRT-LLM throughput numbers which, last I checked, are SOTA in open source.

andersa · 2h ago

TensorRT-LLM being open source is a lie, all the important kernels are loaded from cubins.

radq · 1h ago

Cool project! The codebase is simple and well documented, a good starting point for anyone interested in how to implement a high-performance inference engine. The prefix sharing is very relevant for anyone running batch inference to generate RL rollouts.

AStonesThrow · 10m ago

[delayed]

behnamoh · 5h ago

While Tokasaurus’s Async-TP shows impressive throughput gains, it seems over-engineered for common use cases. The CPU overhead from async tensor parallelism only pays off at 6k+ token batches, and you need NVLink-connected GPUs to see real benefits. Most prod deployments don’t need this complexity — you’re better off with simpler approaches unless you’re specifically optimizing for massive batch throughput. The adaptive manager skipping “optional” tasks under load also feels concerning from a reliability perspective.

bjt12345 · 4h ago

Buy surely next years production deployments will be very different to right now, with different use cases...etc

jdiff · 3h ago

Sure. Things change over time. Is there a reason to believe they'd be different in such a way that this would be more useful than in today's landscape? I haven't seen such a forecast myself.

YetAnotherNick · 4h ago

Depends on what production means for you. This is useful for batch production jobs.

Also, this seems very useful for generating synthetic data or labelling a bunch of data. 6k batch size is small for data labelling.

How we’re responding to The NYT’s data demands in order to protect user privacy (openai.com)

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads (scalingintelligence.stanford.edu)

Show HN: Claude Composer (github.com)

The impossible predicament of the death newts (crookedtimber.org)

APL Interpreter – An implementation of APL, written in Haskell (2024) (scharenbroch.dev)

What a developer needs to know about SCIM (tesseral.com)

Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

Machine Learning: The Native Language of Biology (decodingbiology.substack.com)

Defending Adverbs Exuberantly If Conditionally (countercraft.substack.com)

Test Postgres in Python Like SQLite (github.com)

SkyRoof: New Ham Satellite Tracking and SDR Receiver Software (rtl-sdr.com)

Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

Seven Days at the Bin Store (defector.com)

I do not remember my life and it's fine (aethermug.com)

Show HN: Lambduck, a Functional Programming Brainfuck (imjakingit.github.io)

Converge (YC S23) Well-capitalized New York startup seeks product developers (runconverge.com)

Show HN: iOS Screen Time from a REST API (thescreentimenetwork.com)

A proposal to restrict sites from accessing a users’ local network (github.com)

Programming language Dino and its implementation (github.com)

Eleven v3 (elevenlabs.io)

Show HN: ClickStack – Open-source Datadog alternative by ClickHouse and HyperDX (github.com)

How Common Is Multiple Invention? (construction-physics.com)

Anthropic co-founder on cutting access to Windsurf (techcrunch.com)

I made a search engine worse than Elasticsearch (2024) (softwaredoug.com)

Show HN: Container Use for Agents (github.com)

Apple Notes Will Gain Markdown Export at WWDC, and, I Have Thoughts (daringfireball.net)

Autonomous drone defeats human champions in racing first (tudelft.nl)

Understanding the PURL Specification (Package URL) (fossa.com)

Twitter's new encrypted DMs aren't better than the old ones (mjg59.dreamwidth.org)

OpenAI slams court order to save all ChatGPT logs, including deleted chats (arstechnica.com)

Phptop: Simple PHP ressource profiler, safe and useful for production sites (github.com)

Show HN: String Flux – Simplify everyday string transformations for developers (stringflux.io)

parrot.live (github.com)

From tokens to thoughts: How LLMs and humans trade compression for meaning (arxiv.org)

LLMs and Elixir: Windfall or Deathblow? (zachdaniel.dev)

Aurora, a foundation model for the Earth system (nytimes.com)

End of an Era: Landsat 7 Decommissioned After 25 Years of Earth Observation (usgs.gov)

Rare black iceberg spotted off Labrador coast could be 100k years old (cbc.ca)

Show HN: I made a 3D SVG Renderer that projects textures without rasterization (seve.blog)

Prompt engineering playbook for programmers (addyo.substack.com)

The iPhone 15 Pro’s Depth Maps (tech.marksblogg.com)

A Spiral Structure in the Inner Oort Cloud (iopscience.iop.org)

Cursor 1.0 (cursor.com)

The Universal Tech Tree (asteriskmag.com)

Reproducing the deep double descent paper (stpn.bearblog.dev)

Gemini-2.5-pro-preview-06-05 (deepmind.google)

FFmpeg merges WebRTC support (git.ffmpeg.org)

When memory was measured in kilobytes: The art of efficient vision (softwareheritage.org)

How we reduced the impact of zombie clients (letsencrypt.org)

Show HN: Grab a Random ArXiv Paper (jepedersen.dk)

Tokasaurus: An LLM Inference Engine for High-Throughput Workloads

Comments (9)