So sonnet-4 is faster than gemini-2.5-flash at long context. That is surprising. Especially since Gemini runs on those fast TPUS.
curl-up · 2h ago
Note that (in the first test, the only one where output length is reported), Gemini Pro returned more than 3x the amount of text, at less than 2x the amount of time. From my experience with Gemini, that time was probably mainly spent on thinking, length of which is not reported here. So looking at pure TPS of output, Gemini is faster, but without clear info on the thinking time/length, it's impossible to judge.
jbellis · 2h ago
if they left them both on defaults, flash is thinking-by-default and sonnet 4 is no-thinking-by-default
bitpush · 2h ago
> Claude’s overall response was consistently around 500 words—Flash and Pro delivered 3,372 and 1,591 words by contrast.
It isnt clear from the article whether the time they quote is time-to-first-token or time to completion. If it is latter, then it makes sense why gemini* would take longer even with similar token throughput.
lugao · 1h ago
Anthropic also uses TPUs for inference.
irthomasthomas · 56m ago
Do they rent them from Google? Or are they a different brand?
i’m really curious how well they perform with a long chat history. i find that gemini often gets confused when the context is long enough and starts responding to prior prompts, using the cli or it’s gem chat window.
XenophileJKO · 1h ago
From my experience. Gemini is REALLY bad about context blending. It can't keep track of what I said and what it said in a conversation under 200K tokens. It blends concepts and statements up, then refers to some fabricated hybrid fact or comment.
Gemini has done this in ways that I haven't seen in the recent or current generation models from OpenAI or Anthropic.
It really surprised me that Gemini performs so well in multi-turn benchmarks, given that tendency.
IanCal · 32m ago
I’ve not experimented with the recent models for this but older Gemini models were awful for this - they’d lie about what I’d said or what was in their system prompt even with short conversations.
akomtu · 1h ago
IMO, a good contest between LLMs would be data compression. Each LLM is given the same pile of text, and then asked to create compact notes that fit into N pages of text. Then the original text is replaced with their notes and they need to answer a bunch of questions about the original text using the notes alone.
koakuma-chan · 2h ago
I really doubt you can fit all Harry Potter books in 1M tokens.
PeterStuer · 1h ago
The series is 1,084,170 words. At let's say 1.4 tokens per word, this would not fit, but it is getting close.
koakuma-chan · 48m ago
It's 2M tokens for Gemini.
gcr · 2h ago
The entire HP series is about one million words.
koakuma-chan · 1h ago
Harry Potter and the Order of Phoenix alone is 400K tokens.
Claude Sonnet 4 now supports 1M tokens of context - https://news.ycombinator.com/item?id=44878147 - Aug 2025 (160 comments)
It isnt clear from the article whether the time they quote is time-to-first-token or time to completion. If it is latter, then it makes sense why gemini* would take longer even with similar token throughput.
https://ghostarchive.org/archive/JlE5T
https://web.archive.org/web/20250812172455/https://every.to/...
Gemini has done this in ways that I haven't seen in the recent or current generation models from OpenAI or Anthropic.
It really surprised me that Gemini performs so well in multi-turn benchmarks, given that tendency.