Claude Sonnet 4 now supports 1M tokens of context

455 adocomplete 223 8/12/2025, 4:02:23 PM anthropic.com ↗

Comments (223)

aliljet · 2h ago

This is definitely one of my CORE problem as I use these tools for "professional software engineering." I really desperately need LLMs to maintain extremely effective context and it's not actually that interesting to see a new model that's marginally better than the next one (for my day-to-day).

However. Price is king. Allowing me to flood the context window with my code base is great, but given that the price has substantially increased, it makes sense to better manage the context window into the current situation. The value I'm getting here flooding their context window is great for them, but short of evals that look into how effective Sonnet stays on track, it's not clear if the value actually exists here.

benterix · 2h ago

> it's not clear if the value actually exists here.

Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative.

I will give it another run in 6-8 months though.

cambaceres · 1h ago

For me it’s meant a huge increase in productivity, at least 3X.

Since so many claim the opposite, I’m curious to what you do more specifically? I guess different roles/technologies benefit more from agents than others.

I build full stack web applications in node/.net/react, more importantly (I think) is that I work on a small startup and manage 3 applications myself.

wiremine · 1h ago

> Having spent a couple of weeks on Claude Code recently, I arrived to the conclusion that the net value for me from agentic AI is actually negative.

> For me it’s meant a huge increase in productivity, at least 3X.

How do we reconcile these two comments? I think that's a core question of the industry right now.

My take, as a CTO, is this: we're giving people new tools, and very little training on the techniques that make those tools effective.

It's sort of like we're dropping trucks and airplanes on a generation that only knows walking and bicycles.

If you've never driven a truck before, you're going to crash a few times. Then it's easy to say "See, I told you, this new fangled truck is rubbish."

Those who practice with the truck are going to get the hang of it, and figure out two things:

1. How to drive the truck effectively, and

2. When NOT to use the truck... when talking or the bike is actually the better way to go.

We need to shift the conversation to techniques, and away from the tools. Until we do that, we're going to be forever comparing apples to oranges and talking around each other.

weego · 26m ago

In a similar role and place with this.

My biggest take so far: If you're a disciplined coder that can handle 20% of an entire project's (project being a bug through to an entire app) time being used on research, planning and breaking those plans into phases and tasks, then augmenting your workflow with AI appears to be to have large gains in productivity.

Even then you need to learn a new version of explaining it 'out loud' to get proper results.

If you're more inclined to dive in and plan as you go, and store the scope of the plan in your head because "it's easier that way" then AI 'help' will just fundamentally end up in a mess of frustration.

jeremy_k · 1h ago

Well put. It really does come down to nuance. I find Claude is amazing at writing React / Typescript. I mostly let it do it's own thing and skim the results after. I have it write Storybook components so I can visually confirm things look how I want. If something isn't quite right I'll take a look and if I can spot the problem and fix it myself, I'll do that. If I can't quickly spot it, I'll write up a prompt describing what is going on and work through it with AI assistance.

Overall, React / Typescript I heavily let Claude write the code.

The flip side of this is my server code is Ruby on Rails. Claude helps me a lot less here because this is my primary coding background. I also have a certain way I like to write Ruby. In these scenarios I'm usually asking Claude to generate tests for code I've already written and supplying lots of examples in context so the coding style matches. If I ask Claude to write something novel in Ruby I tend to use it as more of a jumping off point. It generates, I read, I refactor to my liking. Claude is still very helpful, but I tend to do more of the code writing for Ruby.

Overall, helpful for Ruby, I still write most of the code.

These are the nuances I've come to find and what works best for my coding patterns. But to your point, if you tell someone "go use Claude" and they have have a preference in how to write Ruby and they see Claude generate a bunch of Ruby they don't like, they'll likely dismiss it as "This isn't useful. It took me longer to rewrite everything than just doing it myself". Which all goes to say, time using the tools whether its Cursor, Claude Code, etc (I use OpenCode) is the biggest key but figuring out how to get over the initial hump is probably the biggest hurdle.

k9294 · 43m ago

For this very reason I switched for TS for backend as well. I'm not a big fun of JS but the productivity gain of having shared types between frontend and backend and the Claude code proficiency with TS is immense.

jeremy_k · 1m ago

I considered this, but I'm just too comfortable writing my server logic in Ruby on Rails (as I do that for my day job and side project). I'm super comfortable writing client side React / Typescript but whenever I look at server side Typescript code I'm like "I should understand what this is doing but I don't" haha.

croes · 47m ago

Do you only skim the results or do you audit them at some point to prevent security issues?

jeremy_k · 5m ago

What kind of security issues are you thinking about? I'm generating UI components like Selects for certain data types or Charts of data.

quikoa · 48m ago

It's not just about the programmer and his experience with AI tools. The problem domain and programming language(s) used for a particular project may have a large impact on how effective the AI can be.

jdgoesmarching · 1h ago

Agreed, and it drives me bonkers when people talk about AI coding as if it represents some a single technique, process, or tool.

Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

We don’t even fully agree on the best practices for writing code without AI.

mh- · 15m ago

> Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

Older person here: they absolutely did, all over the place in the early 90s. I remember people decrying projects that moved them to computers everywhere I went. Doctors offices, auto mechanics, etc.

Then later, people did the same thing about the Internet (was written with a single word capital I by 2000, having been previously written as two separate words.)

https://i.imgur.com/vApWP6l.png

jacquesm · 1m ago

And not all of those people were wrong.

moregrist · 18m ago

> Makes me wonder if people spoke this way about “using computers” or “using the internet” in the olden days.

There were gobs of terrible road metaphors that spun out of calling the Internet the “Information Superhighway.”

Gobs and gobs of them. All self-parody to anyone who knew anything.

I hesitate to relate this to anything in the current AI era, but maybe the closest (and in a gallows humor/doomer kind of way) is the amount of exec speak on how many jobs will be replaced.

ath3nd · 16m ago

> How do we reconcile these two comments? I think that's a core question of the industry right now.

The current freshest study focusing on experienced developers showed a net negative in the productivity when using an LLM solution in their flow:

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

My conclusion on this, as an ex VP of Engineering, is that good senior developers find little utility with LLMs and even them to be a nuisance/detriment, while for juniors, they can be godsend, as they help them with syntax and coax the solution out of them.

It's like training wheels to a bike. A toddler might find 3x utility, while a person who actually can ride a bike well will find themselves restricted by training wheels.

nabla9 · 55m ago

I agree.

I experience a productivity boost, and I believe it’s because I prevent LLMs from making design choices or handling creative tasks. They’re best used as a "code monkey", fill in function bodies once I’ve defined them. I design the data structures, functions, and classes myself. LLMs also help with learning new libraries by providing examples, and they can even write unit tests that I manually check. Importantly, no code I haven’t read and accepted ever gets committed.

Then I see people doing things like "write an app for ....", run, hey it works! WTF?

rs186 · 1h ago

3X if not 10X if you are starting a new project with Next.js, React, Tailwind CSS for a fullstack website development, that solves an everyday problem. Yeah I just witnessed that yesterday when creating a toy project.

For my company's codebase, where we use internal tools and proprietary technology, solving a problem that does not exist outside the specific domain, on a codebase of over 1000 files? No way. Even locating the correct file to edit is non trivial for a new (human) developer.

tptacek · 1m ago

That's an interesting comment, because "locating the correct file to edit" was the very first thing LLMs did that was valuable to me as a developer.

mike_hearn · 1h ago

My codebase has about 1500 files and is highly domain specific: it's a tool for shipping desktop apps[1] that handles all the building, packaging, signing, uploading etc for every platform on every OS simultaneously. It's written mostly in Kotlin, and to some extent uses a custom in-house build system. The rest of the build is Gradle, which is a notoriously confusing tool. The source tree also contains servers, command line tools and a custom scripting language which is used for all the scripting needs of the project [2].

The code itself is quite complex and there's lots of unusual code for munging undocumented formats, speaking undocumented protocols, doing cryptography, Mac/Windows specific APIs, and it's all built on a foundation of a custom parallel incremental build system.

In other words: nightmare codebase for an LLM. Nothing like other codebases. Yet, Claude Code demolishes problems in it without a sweat.

I don't know why people have different experiences but speculating a bit:

1. I wrote most of it myself and this codebase is unusually well documented and structured compared to most. All the internal APIs have full JavaDocs/KDocs, there are extensive design notes in Markdown in the source tree, the user guide is also part of the source tree. Files, classes and modules are logically named. Files are relatively small. All this means Claude can often find the right parts of the source within just a few tool uses.

2. I invested in making a good CLAUDE.md and also wrote a script to generate "map.md" files that are at the top of every module. These map files contain one-liners of what every source file contains. I used Gemini to make these due to its cheap 1M context window. If Claude does struggle to find the right code by just reading the context files or guessing, it can consult the maps to locate the right place quickly.

3. I've developed a good intuition for what it can and cannot do well.

4. I don't ask it to do big refactorings that would stress the context window. IntelliJ is for refactorings. AI is for writing code.

[1] https://hydraulic.dev

[2] https://hshell.hydraulic.dev/

GenerocUsername · 1h ago

Your first week of AI usage should be crawling your codebase and generating context.md docs that can then be fed back into future prompts so that AI understands your project space, packages, apis, and code philosophy.

I guarantee your internal tools are not revolutionary, they are just unrepresented in the ML model out of the box

orra · 1h ago

That sounds incredibly boring.

Is it effective? If so I'm sure we'll see models to generate those context.md files.

cpursley · 35m ago

Yes. And way less boring than manually reading a section of a codebase to understand what is going on after being away from it for 8 months. Claude's docs and git commit writing skills are worth it for that alone.

blitztime · 58m ago

How do you keep the context.md updated as the code changes?

shmoogy · 26m ago

I tell Claude to update it generally but you can probably use a hook

tombot · 10m ago

This, while it has context of the current problem, just ask Claude to re-read it's own documentation and think of things to add that will help it in the future

nicce · 1h ago

Even then, are you even allowed to use AI in such codebase. Is some part of the code "bought", e.g. commercial compiler generated with specific license? Is pinky promise from LLM provider enough?

MattGaiser · 1h ago

Yeah, anecdotally it is heavily dependent on:

1. Using a common tech. It is not as good at Vue as it is at React.

2. Using it in a standard way. To get AI to really work well, I have had to change my typical naming conventions (or specify them in detail in the instructions).

nicce · 1h ago

React also seems to be actually alias for Next.js. Models have hard time to make the difference.

elevatortrim · 1h ago

I think there are two broad cases where ai coding is beneficial:

1. You are a good coder but working on a new (to you) or building a new project, or working with a technology you are not familiar with. This is where AI is hugely beneficial. It does not only accelerate you, it lets you do things you could not otherwise.

2. You have spent a lot of time on engineering your context and learning what AI is good at, and using it very strategically where you know it will save time and not bother otherwise.

If you are a really good coder, really familiar with the project, and mostly changing its bits and pieces rather than building new functionality, AI won’t accelerate you much. Especially if you did not invest the time to make it work well.

bcrosby95 · 23m ago

My current guess is it's how the programmer solves problems in their head. This isn't something we talk about much.

People seem to find LLMs do well with well-spec'd features. But for me, creating a good spec doesn't take any less time than creating the code. The problem for me is the translation layer that turns the model in my head into something more concrete. As such, creating a spec for the LLM doesn't save me any time over writing the code myself.

So if it's a one shot with a vague spec and that works that's cool. But if it's well spec'd to the point the LLM won't fuck it up then I may as well write it myself.

acedTrex · 1h ago

I have yet to get it to generate code past 10ish lines that I am willing to accept. I read stuff like this and wonder how low yall's standards are, or if you are working on projects that just do not matter in any real world sense.

dillydogg · 1h ago

Whenever I read comments from the people singing their praises of the technology, it's hard not to think of the study that found AI tools made developers slower in early 2025.

>When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

mstkllah · 20m ago

Ah, the very extensive study with 16 developers. Bulletproof results.

djeastm · 30m ago

Standards are going to be as low as the market allows I think. Some industries code quality is paramount, other times its negligible and perhaps speed of development is higher priority and the code is mostly disposable.

spicyusername · 1h ago

4/5 times I can easily get 100s of lines output, that only needs a quick once over.

1/5 times, I spend an extra hour tangled in code it outputs that I eventually just rewrite from scratch.

Definitely a massive net positive, but that 20% is extremely frustrating.

acedTrex · 1h ago

That is fascinating to me, i've never seen it generate that much code that is actually something i would consider correct. It's always wrong in some way.

byryan · 17m ago

That makes sense, especially if your building web applications that are primarily "just" CRUD operations. If a lot of the API calls follow the same pattern and the application is just a series of API calls + React UI then that seems like something an LLM would excel at. LLM's are also more proficient in TypeScript/JS/Python compared to other languages, so that helps as well.

nicce · 1h ago

> I build full stack web applications in node/.net/react, more importantly (I think) is that I work on a small startup and manage 3 applications myself.

I think this is your answer. For example, React and JavaScript are extremely popular and aged. Are you using TypeScript and want to get most of the types or are you accepting everything that LLM gives as JavaScript? How interested you are about the code whether it is using "soon to be deprecated" functions or the most optimized loop/implementation? How about the project structure?

In other cases, the more precision you need, less effective LLM is.

evantbyrne · 1h ago

The problem with these discussions is that almost nobody outside of the agency/contracting world seems to track their time. Self-reported data is already sketchy enough without layering on the issue of relying on distant memory of fine details.

thanhhaimai · 1h ago

I work across the stack (frontend, backend, ML)

- For FrontEnd or easy code, it's a speed up. I think it's more like 2x instead of 3x.

- For my backend (hard trading algo), it has like 90% failure rate so far. There is just so much for it to reason through (balance sheet, lots, wash, etc). All agents I have tried, even on Max mode, couldn't reason through all the cases correctly. They end up thrashing back and forth. Gemini most of the time will go into the "depressed" mode on the code base.

One thing I notice is that the Max mode on Cursor is not worth it for my particular use case. The problem is either easy (frontend), which means any agent can solve it, or it's hard, and Max mode can't solve it. I tend to pick the fast model over strong model.

squeaky-clean · 45m ago

I just want to point out that they only said agentic models were a negative, not AI in general. I don't know if this is what they meant, but I personally prefer to use a web or IDE AI tool and don't really like the agentic stuff compared to those. For me agentic AI would be a net positive against no-AI, but it's a net negative compared to other AI interfaces

andrepd · 1h ago

Self-reports are notoriously overexcited, real results are, let's say, not so stellar.

https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

dingnuts · 1h ago

You have small applications following extremely common patterns and using common libraries. Models are good at regurgitating patterns they've seen many times, with fuzzy find/replace translations applied.

Try to build something like Kubernetes from the ground up and let us know how it goes. Or try writing a custom firmware for a device you just designed. Something like that.

dmitrygr · 28m ago

> For me it’s meant a huge increase in productivity, at least 3X.

Quote possibly you are doing very common things that are often done and thus are in the training set a lot, the parent post is doing something more novel that forces the model to extrapolate, which they suck at.

datadrivenangel · 1h ago

How do you structure your applications for maintainability?

flowerthoughts · 1h ago

What type of work do you do? And how do you measure value?

Last week I was using Claude Code for web development. This week, I used it to write ESP32 firmware and a Linux kernel driver. Sure, it made mistakes, but the net was still very positive in terms of efficiency.

verall · 52m ago

> This week, I used it to write ESP32 firmware and a Linux kernel driver.

I'm not meaning to be negative at all, but was this for a toy/hobby or for a commercial project?

I find that LLMs do very well on small greenfield toy/hobby projects but basically fall over when brought into commercial projects that often have bespoke requirements and standards (i.e. has to cross compile on qcc, comply with autosar, in-house build system, tons of legacy code laying around maybe maybe not used).

So no shade - I'm just really curious what kind of project you were able get such good results writing ESP32 FW and kernel drivers for :)

lukebechtel · 47m ago

Maintaining project documentation is:

(1) Easier with AI

(2) Critical for letting AI work effectively in your codebase.

Try creating well structured rules for working in your codebase, put in .cursorrules or Claude equivalent... let AI help you... see if that helps.

GodelNumbering · 20m ago

This is my experience too. Also, their propensity to jump into code without necessarily understanding the requirement is annoying to say the least. As the project complexity grows, you find yourself writing longer and longer instructions just to guardrail.

Another rather interesting thing is that they tend to gravitate towards sweep the errors under the rug kind of coding which is disastrous. e.g. "return X if we don't find the value so downstream doesn't crash". These are the kind of errors no human, even a beginner on their first day learning to code, wouldn't make and are extremely annoying to debug.

Tl;dr: LLMs' tendency to treat every single thing you give it as a demo homework project

tombot · 11m ago

> their propensity to jump into code without necessarily understanding the requirement is annoying to say the least.

Then don't let it, collaborate on the spec, ask Claude to make a plan. You'll get far better results

https://www.anthropic.com/engineering/claude-code-best-pract...

oceanplexian · 45m ago

I work in FAANG, have been for over a decade. These tools are creating a huge amount of value, starting with Copilot but now with tools like Claude Code and Cursor. The people doing so don’t have a lot of time to comment about it on HN since we’re busy building commercial products used by millions of people.

nme01 · 6m ago

I also work for a FAANG company and so far most employees agree that while LLMs are good for writing docs, presentations or emails, they still lack a lot when it comes to writing a maintainable code (especially in Java, they supposedly do better in Go, don’t know why, not my opinion). Even simple refactorings need to be carefully checked. I really like them for doing stuff that I know nothing about though (eg write a script using a certain tool, tell me how to rewrite my code to use certain library etc) or for reviewing changes

GodelNumbering · 9m ago

I don't see how FAANG is relevant here. But the 'FAANG' I used to work at had an emergent problem of people throwing a lot of half baked 'AI-powered' code over the wall and let reviewers deal with it (due to incentives, not that they were malicious). In orgs like infra where everything needs to be reviewed carefully, this is purely a burden

jpc0 · 18m ago

> These tools are creating a huge amount of value...

> The people doing so don’t have a lot of time to comment about it on HN since we’re busy building…

“We’re so much more productive that we don’t have time to tell you how much more productive we are”

Do you see how that sounds?

wijwp · 8m ago

To be fair, AI isn't going to give us more time outside work. It'll just increase expectations from leadership.

nomel · 36m ago

What are the AI usage policies like at your org? Where I am, we’re severely limited.

greenie_beans · 26m ago

same. agents are good with easy stuff and debugging but extremely bad with complexity. has no clue about chesterson's fence, and it's hard to parse the results especially when it creates massive diffs. creates a ton of abandoned/cargo code. lots of misdirection with OOP.

chatting witch claude and copy/pasting code between my IDE and claude is still effectively for more complex stuff.

mikepurvis · 1h ago

For a bit more nuance, I think I would my overall net is about break even. But I don't take that as "it's not worth it at all, abandon ship" but rather that I need to hone my instinct of what is and is not a good task for AI involvement, and what that involvement should look like.

Throwing together a GHA workflow? Sure, make a ticket, assign it to copilot, check in later to give a little feedback and we're golden. Half a day of labour turned into fifteen minutes.

But there are a lot of tasks that are far too nuanced where trying to take that approach just results in frustration and wasted time. There it's better to rely on editor completion or maybe the chat interface, like "hey I want to do X and Y, what approach makes sense for this?" and treat it like a rubber duck session with a junior colleague.

mark_l_watson · 1h ago

I am sort of with you. I am down to asking Gemini Pro a couple of questions a day, use ChatGPT just a few times a week, and about once a week use gemini-cli (either a short free session, or a longer session where I provide my API key.)

That said I spend (waste?) an absurdly large amount of time each week experimenting with local models (sometimes practical applications, sometimes ‘research’).

revskill · 1h ago

Truth. To some extend, the agent doesn't know what it's doing at all, it lacks real brain, maybe we should just treat them as the hard worker.

wahnfrieden · 1h ago

Did you try with using Opus exclusively?

freedomben · 1h ago

Do you know if there's a way to force Claude code to do that exclusively? I've found a few env vars online but they don't seem to actually work

atonse · 1h ago

You can type /config and then go to the setting to pick a model.

gdudeman · 1h ago

Yes: type /model and then pick Opus 4.1.

artursapek · 1h ago

You can "force" it by just paying them $200 (which is nothing compared to the value)

epiccoleman · 56m ago

is Opus that much better than Sonnet? My sub is $20 a month, so I guess I'd have to buy that I'm going to get a 10x boost, which seems dubious

parineum · 1h ago

Value is irrelevant. What's the return on investment you get from spending $200?

Collecting value doesn't really get you anywhere if nobody is compensating you for it. Unless someone is going to either pay for it for you or give you $200/mo post-tax dollars, it's costing you money.

wahnfrieden · 1h ago

The return for me is faster output of features, fixes, and polish for my products which increases revenue above the cost of the tool. Did you need to ask this?

parineum · 33m ago

Yes, I did. Not everybody has their own product that might benefit from a $200 subscription. Most of us work for someone else and, unless that person is paying for the subscription, the _value_ it adds is irrelevant unless it results in better compensation.

Furthermore, the advice was given to upgrade to a $200 subscription from the $20 subscription. The difference in value that might translate into income between the $20 option and the $200 option is very unclear.

wahnfrieden · 20m ago

If you are employed you should petition your employer for tools you want. Maybe you can use it to take the day off earlier or spend more time socializing. Or to get a promotion or performance bonus. Hopefully not just to meet rising productivity expectations without being handed the tools needed to achieve that. Having full-time access to these tools can also improve your own skills in using them, to profit from in a later career move or from contributing toward your own ends.

wahnfrieden · 1h ago

Peter Steinberger has been documenting his workflows and he relies exclusively on Opus at least until recently. (He also pays for a few Max 20x subscriptions at once to avoid rate limits.)

rootnod3 · 2h ago

Flooding the context also means increasing the likelihood of the LLM confusing itself. Mainly because of the longer context. It derails along the way without a reset.

aliljet · 2h ago

How do you know that?

giancarlostoro · 1h ago

Here's a paper from MIT that covers how this could be resolved in an interesting fashion:

https://hanlab.mit.edu/blog/streamingllm

The AI field is reusing existing CS concepts for AI that we never had hardware for, and now these people are learning how applied Software Engineering can make their theoretical models more efficient. It's kind of funny, I've seen this in tech over and over. People discover new thing, then optimize using known thing.

mamp · 14m ago

Unfortunately, I think the context rot paper [1] found that the performance degradation when context increased still occurred in models using attention sinks.

1. https://research.trychroma.com/context-rot

bigmadshoe · 2h ago

https://research.trychroma.com/context-rot

joenot443 · 8m ago

This is a good piece. Clearly it's a pretty complex problem and the intuitive result a layman engineer like myself might expect doesn't reflect the reality of LLMs. Regex works as reliably on 20 characters as it does 2m characters; the only difference is speed. I've learned this will probably _never_ be the case with LLMs, there will forever exist some level of epistemic doubt in its result.

When they announced Big Contexts in 2023, they referenced being able to find a single changed sentence in the context's copy of Great Gatsby[1]. This example seemed _incredible_ to me at the time but now two years later I'm feeling like it was pretty cherry-picked. What does everyone else think? Could you feed a novel into an LLM and expect it to find the single change?

[1] https://news.ycombinator.com/item?id=35941920

rootnod3 · 1h ago

The longer the context and the discussion goes on, the more it can get confused, especially if you have to refine the conversation or code you are building on.

Remember, in its core it's basically a text prediction engine. So the more varying context there is, the more likely it is to make a mess of it.

Short context: conversion leaves the context window and it loses context. Long context: it can mess with the model. So the trick is to strike a balance. But if it's an online models, you have fuck all to control. If it's a local model, you have some say in the parameters.

anonz4FWNqnX · 1h ago

I've had similar experiences. I've gone back and forth between running models locally and using the commercial models. The local models can be incredibly useful (gemma, qwen), but they need more patience and work to get them to work.

One advantage to running locally[1] is that you can set the context length manually and see how well the llm uses it. I don't have an exact experience to relay, but it's not unusual for models to be allow longer contexts, but ignore that context.

Just making the context big doesn't mean the LLM is going to use it well.

[1] I've using lm studio on both a macbook air and a macbook pro. Even a macbook air with 16G can run pretty decent models.

nomel · 23m ago

A good example of this was the first Gemini model that allowed 1 million tokens, but would lose track of the conversation after a couple paragraphs.

EForEndeavour · 2h ago

https://onnyunhui.medium.com/evaluating-long-context-lengths...

fkyoureadthedoc · 1h ago

https://github.com/adobe-research/NoLiMa

F7F7F7 · 2h ago

What do you think happens when things start falling outside of its context window? It loses access to parts of your conversation.

And that’s why it will gladly rebuild the same feature over and over again.

jacquesm · 2m ago

Prediction: the complexity of the prompt will approach the complexity of the work as the work gets more complex. So for simple stuff you'll be ahead, and for more complex stuff you'll end up in an endless cycle of refinement to get what you want.

The reason why I'm predicting this: I've been following this saga since 1985 or so when 'Mimer' was touted to be the next best thing since sliced bread. It would make programmers obsolete because now the managers could just input the spec and a fully debugged perfect program would roll out. So called 'RAD' tools were all the rage back then. Much like 'COBOL' in the 60's (which, you've guessed it, was so powerful managers could now create software themselves).

Both of these failed.

Then we got ever more frequent re-runs of this movie, but it always ended the same: toy projects, proofs of concept, those were easy. But any attempt to expand the project would sooner or later run into the limitations of the tooling. And then the spec would start to balloon. Up to the point where the managers would throw their hands up and call the 'real' programmers asking them to fix their little issue. Which almost always ended up with the problem not being fixed or a complete rewrite in a proper programming language. COBOL really was a step up from assembly, so it earned its place, even if managers couldn't express their wishes in that English like programming language. But MIMER wasn't a step up as much as it was a step sideways, just another indirection layer and one that made debugging much harder.

So now we are in the AI assisted programmer era. And we're doing the same thing, again. This time the tools are a lot smarter. But the symptoms are there already: do something simple and the tool is amazing: it almost reads your mind. But you have to keep the scope small and show it only the bit that it needs to know about or you will have a lot of work that you don't need (efficiency, right?). If you start expanding the scope, you'll find your specs (we call them prompts now, but they're the same thing to me) start to grow at a much faster rate than your spec. Because suddenly you are not just trying to control the local scope, but also the much larger global scope and all of the side effects. In a regular programming language such scope creep can be fought. Some ways to do this are to use side effect free code and so called 'pure functions'. But your AI isn't quite clever enough to do this where it is needed and to drop it where you actually want those side effects.

Endless prompt-revisions ensue until you think it does what it should. Or does it? What about bad inputs? What about race conditions? What about malicious users? What about security? Never mind, ship it.

I think - and that's possibly delusional - that there will still be a market for well crafted software, and that AI may be able to help an expert, for instance with providing guardrails and boilerplate. But I've never found my typing speed to be much of a limitation and the guardrails that an AI can provide seem rather superficial to me. The 1x programmer that suddenly becomes a 10x programmer on a major codebase I've yet to meet, though I'm sure that if they exist - or at least, believe that they exist - they'll correct me shortly.

alexchamberlain · 2h ago

I'm not sure how, and maybe some of the coding agents are doing this, but we need to teach the AI to use abstractions, rather than the whole code base for context. We as humans don't hold the whole codebase in our hear, and we shouldn't expect the AI to either.

anthonypasq · 1h ago

the fact we cant keep the repo in our working memory is a flaw of our brains. i cant see how you could possibly make the argument that if you were somehow able to keep the entire codebase in your head that it would be a disadvantage.

SkyBelow · 1h ago

Information tradeoff. Even if you could keep the entire code base in memory, if something else has to be left out of memory, then you have to consider the value of an abstraction verses whatever other information is lost. Abstractions also apply to the business domain and works the same.

You also have time tradeoffs. Like time to access memory and time to process that memory to achieve some outcome.

There is also quality. If you can keep the entire code base in memory but with some chance of confusion, while abstractions will allow less chance of confusion, then the tradeoff of abstractions might be worth it still.

Even if we assume a memory that has no limits, can access and process all information at constant speed, and no quality loss, there is still communication limitations to worry about. Energy consumption is yet another.

sdesol · 1h ago

LLMs (current implementation) are probabilistic so it really needs the actual code to predict the most likely next tokens. Now loading the whole code base can be a problem in itself, since other files may negatively affect the next token.

nomel · 16m ago

No, it doesn’t, nor do we. It’s why abstractions and documentations exist.

If you know what a function achieves, and you trust it to do that, you don’t need to see/hold its exact implementation in your head.

sdesol · 9m ago

But documentation doesn't include styling or preferred pattern, which is why I think a lot people complain that the LLM will just produce garbage. Also documentation is not guaranteed to be correct or up to date. To be able to produce the best code based on what you are hoping for, I do think having the actual code is necessary unless styling/design patterns are not important, then yes documentation will be suffice, provided they are accurate and up to date.

photon_lines · 1h ago

Sorry -- I keep seeing this being used but I'm not entirely sure how it differs from most of human thinking. Most human 'reasoning' is probabilistic as well and we rely on 'associative' networks to ingest information. In a similar manner - LLMs use association as well -- and not only that, but they are capable of figuring out patterns based on examples (just like humans are) -- read this paper for context: https://arxiv.org/pdf/2005.14165. In other words, they are capable of grokking patterns from simple data (just like humans are). I've given various LLMs my requirements and they produced working solutions for me by simply 1) including all of the requirements in my prompt and 2) asking them to think through and 'reason' through their suggestions and the products have always been superior to what most humans have produced. The 'LLMs are probabilistic predictors' comments though keep appearing on threads and I'm not quite sure I understand them -- yes, LLMs don't have 'human context' i.e. data needed to understand human beings since they have not directly been fed in human experiences, but for the most part -- LLMs are not simple 'statistical predictors' as everyone brands them to be. You can see a thorough write-up I did of what GPT is / was here if you're interested: https://photonlines.substack.com/p/intuitive-and-visual-guid...

didibus · 1h ago

You seem possibly more knowledgeable then me on the matter.

My impression is that LLMs predict the next token based on the prior context. They do that by having learned a probability distribution from tokens -> next-token.

Then as I understand, the models are never reasoning about the problem, but always about what the next token should be given the context.

The chain of thought is just rewarding them so that the next token isn't predicting the token of the final answer directly, but instead predicting the token of the reasoning to the solution.

Since human language in the dataset contains text that describes many concepts and offers many solutions to problems. It turns out that predicting the text that describes the solution to a problem often ends up being the correct solution to the problem. That this was true was kind of a lucky accident and is where all the "intelligence" comes from.

photon_lines · 17m ago

So - in the pre-training step you are right -- they are simple 'statistical' predictors but there are more steps involved in their training which turn them from simple predictors to being able to capture patterns and reason -- I tried to come up with an intuitive overview of how they do this in the write-up and I'm not sure I can give you a simple explanation here, but I would recommend you play around with Deep-Seek and other more advanced 'reasoning' or 'chain-of-reason' models and ask them to perform tasks for you: they are not simply statistically combining information together. Many times they are able to reason through and come up with extremely advanced working solutions. To me this indicates that they are not 'accidently' stumbling upon solutions based on statistics -- they actually are able to 'understand' what you are asking them to do and to produce valid results.

sdesol · 1h ago

I'm not sure if I would say human reasoning is 'probabilistic' unless you are taking a very far step back and saying based on how the person lived, they have ingrained biases (weights) that dictates how they reason. I don't know if LLMs have a built in scepticism like humans do, that plays a significant role in reasoning.

Regardless if you believe LLMs are probabilistic or not, I think what we are both saying is context is king and what it (LLM) says is dictated by the context (either through training) or introduced by the user.

photon_lines · 13m ago

'I don't know if LLMs have a built in scepticism like humans do' - humans don't have an 'in built skepticism' -- we learn in through experience and through being taught how to 'reason' within school (and it takes a very long time to do this). You believe that this is in-grained but you may have forgotten having to slog through most of how the world works and being tested when you went to school and when your parents taught you these things. On the context component: yes, context is vitally important (just as it is with humans) -- you can't produce a great solution unless you understand the 'why' behind it and how the current solution works so I 100% agree with that.

Workaccount2 · 16m ago

Humans have a neuro-chemical system that performs operations with electrical signals.

That's the level to look at, unless you have a dualist view of the brain (we are channeling a super-natural forces).

siwatanejo · 2h ago

I do think AIs are already using abstractions, otherwise you would be submitting all the source code of your dependencies into the context.

F7F7F7 · 2h ago

There are a billion and one repos that claim to help do this. Let us know when you find one.

throwaway314155 · 1h ago

/compact in Claude Code is effectively this.

seanmmward · 1h ago

The primary use case isn't just about shoving more code in context, although depending on the task, there is an irredicible minimum context needed for it to capture all the needed understanding. The 1M context model is a unique beast in terms of how you need to feed it, and its real power is being able to tackle long horizon tasks which require iterative exploration, in context learning, and resynthesis. Ie, some problems are breadth (go fix an api change in 100 files), other however require depth (go learn from trying 15 different ways to solve this problem). 1M Sonnet is unique in its capabilities for the latter in particular.

hinkley · 46m ago

Sounds to me like your problem has shifted from how much the AI tool costs per hour to how much it costs per token because resetting a model happens often enough that the price doesn't amortize out per hour. That giant spike every ?? months overshadows the average cost per day.

I wonder if this will become more universal, and if we won't see a 'tick-tock' pattern like Intel used, where they tweak the existing architecture one or more times between major design work. The 'tick' is about keeping you competitive and the 'tock' is about keeping you relevant.

sdesol · 2h ago

> I really desperately need LLMs to maintain extremely effective context

I actually built this. I'm still not ready to say "use the tool yet" but you can learn more about it at https://github.com/gitsense/chat.

The demo link is not up yet as I need to finalize an admin tool but you should be able to follow the npm instructions to play around with.

The basic idea is, you should be able to load your entire repo or repos and use the context builder to help you refine it. Or you can can create custom analyzers that you can do 'AI Assisted' searches with like execute `!ask find all frontend code that does [this]` and the because the analyzer knows how to extract the correct metadata to support that query, you'll be able to easily build the context using it.

hirako2000 · 1h ago

Not clear how it gets around what is, ultimately, a context limit.

I've been fiddling with some process too, would be good if you shared the how. The readme looks like yet another full fledged app.

sdesol · 1h ago

Yes there is a context window limit, but I've found for most frontier models, you can generate very effective code if the context window is under 75,000 tokens provided the context is consistent. You have to think of everything from a probability point of view and the more logical the context, the greater the chances of better code.

For example, if the frontend doesn't need to know the backend code (other than the interface) not including the backend code to solve a frontend one to solve a specific problem can reduce context size and improve the chances of expected output. You just need to ensure you include the necessary interface documenation.

As for the full fledged app, I think you raised a good point and I should add a 'No lock in' section for why to use it. The app has a message tool that lets you pick and choose what messages to copy. Once you've copied the context (including any conversation messages that can help the LLM), you can use the context where ever you want.

My strategy with the app is to be the first place you goto to start a conversation before you even generate code, so my focus is helping you construct contexts (the smaller the better) to feed into LLMs.

handfuloflight · 1h ago

Doesn't Claude Code do all of this automatically?

sdesol · 1h ago

I haven't looked at Claud Code, so I don't know if they have analyzers or not that understands how to extract any type of data other than specific coding data that it is trained on. Based on the runtime for some tasks, I would not be surprised if it is going through all the files and asking "is this relevant"

My tool is mainly targeted at massive code bases and enterprise as I still believe the most efficient way to build accurate context is by domain experts.

Right now, I would say 95% of my code is AI generated (98% human architectured) and I am spending about $2 a day on LLM costs and the code generation part usually never runs more than 30 seconds for most tasks.

handfuloflight · 1h ago

Well you should look at it, because it's not going through all files. I looked at your product and the workflow is essentially asking me to do manually what Claude Code does auto. Granted, manually selecting the context will probably lead to lower costs in any case because Claude Code invokes tool calls like grep to do its search, so I do see merit in your product in that respect.

sdesol · 45m ago

Looking at the code, it does have some sort of automatic discovery. I also don't know how scalable Claude Code is. I've spent over a decade thinking about code search, so I know what the limitations are for enterprise code.

One of the need tricks that I've developed is, I would load all my backend code for my search component and then I would ask the LLM to trace a query and create a context bundle for only the files that are affected. Once the LLM has finished, I just need to do a few clicks to refine a 80,000 token size window down to about 20,000 tokens.

I would not be surprised if this is one of the tricks that it does as it is highly effective. Also, yes my tool is manual, but I treat conversations as durable asset so in the future, you should be able to say, last week I did this, load the same files and LLM will know what files to bring into context.

pacoWebConsult · 35m ago

FWIW Claude code conversations are also durable. You can resume any past conversation in your project. They're stored as jsonl files within your `$HOME/.claude` directory. This retains the actual context (including your prompts, assistant responses, tool usages, etc) from that conversation, not just the files you're affecting as context.

sdesol · 25m ago

Thanks for the info. I actually want to make it easy for people to review aider, plandex, claude code, etc. conversations so I will probably look at importing them.

My goal isn't to replace the other tools, but to make them work smarter and more efficiently. I also think we will in a year or two, start measuring performance based on how developers interact with LLMs (so management will want to see the conversations). Instead of looking at code generated, the question is going to be, if this person is let go, what is the impact based on how they are contributing via their conversations.

handfuloflight · 39m ago

Excellent, I look forward to trying it out, at minimum to wean off dependency to Claude Code and it's likely current state of overspending on context. I agree with looking at conversations as durable assets.

sdesol · 22m ago

> current state of overspending on context

The thing that is killing me when I hear about Claude Code and other agent tools is the amount of energy they must be using. People say they let the task run for an hour and I can't help but to think how much energy is being used and if Claude Code is being upfront with how much things will actually cost in the future.

kvirani · 1h ago

Wait that's not how Cursor etc work? (I made assumptions)

sdesol · 1h ago

I don't use Cursor so I can't say, but based on what I've read, they optimize for smaller context to reduce cost and probably for performance. The issue is, I think this is severely flawed as LLMs are insanely context sensitive and forgetting to include a reference file can lead to undesirable code.

I am obviously biased, but I still think to get the best results, the context needs to be human curated to ensure everything the LLM needs will be present. LLMs are probabilistic, so the more relevant context, the greater the chances the final output is the most desired.

trenchpilgrim · 1h ago

Dunno about Cursor but this is exactly how I use Zed to navigate groups of projects

gdudeman · 1h ago

A tip for those who both use Claude Code and are worried about token use (which you should be if you're stuffing 400k tokens into context even if you're on 20x Max):

  1. Build context for the work you're doing. Put lots of your codebase into the context window.
  2. Do work, but at each logical stopping point hit double escape to rewind to the context-filled checkpoint. You do not spend those tokens to rewind to that point.
  3. Tell Claude your developer finished XYZ, have it read it into context and give high level and low level feedback (Claude will find more problems with your developer's work than with yours).

If you want to have multiple chats running, use /resume and pull up the same thread. Hit double escape to the point where Claude has rich context, but has not started down a specific rabbit hole.

rvnx · 1h ago

Thank you for the tips, do you know how to rollback latest changes ? Trying very hard to do it, but seems like Git is the only way ?

SparkyMcUnicorn · 17m ago

I haven't used it, but saw this the other day: https://github.com/RonitSachdev/ccundo

gdudeman · 45m ago

Git or my favorite "Undo all of those changes."

spike021 · 40s ago

this usually gets the job done for me as well

tankenmate · 2h ago

This is definitely good to have this as an option but at the same time having more context reduces the quality of the output because it's easier for the LLM to get "distracted". So, I wonder what will happen to the quality of code produced by tools like Claude Code if users don't properly understand the trade off being made (if they leave it in auto mode of coding right up to the auto compact).

bachittle · 1h ago

As of now it's not integrated into Claude Code. "We’re also exploring how to bring long context to other Claude products". I'm sure they already know about this issue and are trying to think of solutions before letting users incur more costs on their monthly plans.

PickledJesus · 1h ago

Seems to be for me, I came to look at HN because I saw it was the default in CC

novaleaf · 54m ago

where do you see it in CC?

PickledJesus · 29m ago

I got a notification when I opened it, indicating that the default had changed, and I can see it on /model.

Only on a max (20x) account, not there on a Pro one.

novaleaf · 8m ago

thanks, FYI I'm on a max 20x also and I don't see it!

tehlike · 2h ago

Some reference:

https://simonwillison.net/2025/Jun/29/how-to-fix-your-contex...

https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-ho...

jasonthorsness · 2h ago

What do you recommend doing instead? I've been using Claude Code a lot but am still pretty novice at the best practices around this.

TheDong · 2h ago

Have the AI produce a plan that spans multiple files (like "01 create frontend.md", "02 create backend.md", "03 test frontend and backend running together.md"), and then create a fresh context for each step if it looks like re-using the same context is leading it to confusion.

Also, commit frequently, and if the AI constantly goes down the wrong path ("I can't create X so I'll stub it out with Y, we'll fix it later"), you can update the original plan with wording to tell it not to take that path ("Do not ever stub out X, we must make X work"), and then start a fresh session with an older and simpler version of the code and see if that fresh context ends up down a better path.

You can also run multiple attempts in parallel if you use tooling that supports that (containers + git worktrees is one way)

wongarsu · 1h ago

Changing the prompt and rerunning is something where Cursor still has a clear edge over Claude Code. It's such a powerful technique for keeping the context small because it keeps the context clear of back-and-forths and dead ends. I wish it was more universally supported

abound · 35m ago

I do this all the time in Claude Code, you hit Escape twice and select the conversation point to 'branch' from.

F7F7F7 · 1h ago

Inventivatbly the files become a mess of their own. Changes and learnings from one part of the plan often dont result in adaptation to impacted plans down chain.

In the end you have a mish mash of half implemented plans and now you’ve lost context too. Which leads to blowing tokens on trying to figure out what’s been implemented, what’s half baked, and what was completely ignored.

Any links to anyone who’s built something at scale using this method? It always sounds good on paper.

I’d love to find a system that works.

nzach · 1h ago

In my experience it works better if you create one plan at a time. Create a prompt, make claude implement it and then you make sure it is working as expected. Only then you ask for something new.

I've created an agent to help me create the prompts, it goes something like this: "You are an Expert Software Architect specializing in creating comprehensive, well-researched feature implementation prompts. Your sole purpose is to analyze existing codebases and documentation to craft detailed prompts for new features. You always think deeply before giving an answer...."

My workflow is: 1) use this agent to create a prompt for my feature; 2) ask claude to create a plan for the just created prompt; 3) ask claude to implement said plan if it looks good.

cube00 · 37m ago

>You always think deeply before giving an answer...

Nice try but they're not giving you the "think deeper" level just because you asked.

nzach · 35m ago

https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

dpe82 · 26m ago

Actually that's exactly how you do it.

brandall10 · 1h ago

My system is to create detailed feature files up to a few hundred lines in size that are immutable, and then have a status.md file (preferably kept to about 50 lines) that links to a current feature that is used as a way to keep track of the progress on that feature.

Additionally I have a Claude Code command with instructions referencing the status.md, how to select the next task, how to compact status.md, etc.

Every time I'm done with a unit of work from that feature - always triggered w/ ultrathink - I'll put up a PR and go through the motions of extra refactors/testing. For more complex PRs that require many extra commits to get prod ready I just let the sessions auto-compact.

After merging I'll clear the context and call the CC command to progress to the next unit of work.

This allows me to put up to around 4-5 meaningful PRs per feature if it's reasonably complex while keeping the context relatively tight. The current project I'm focused on is just over 16k LOC in swift (25k total w/ tests) and it seems to work pretty well - it rarely gets off track, does unnecessary refactors, destroys working features, etc.

nzach · 30m ago

Care to elaborate on how you use the status.md file? What exactly you put in there, and what value does it bring?

agotterer · 1h ago

I use the main Claude code thread (I don’t know what to call it) for planning and then explicitly tell Claude to delegate certain standalone tasks out to subagents. The subagents don’t consume the main threads context window. Even just delegating testing, debugging, and building will save a ton context.

joduplessis · 9m ago

As far as coding goes Claude seems to be the most competent right now, I like it. GPT5 is abysmal - I'm not sure if they're bugs, or what, but the new release takes a good few steps back. Gemini still a hit and miss - and Grok seems to be a poor man's Claude (where code is kind of okay, a bit buggy and somehow similar to Claude).

firasd · 1h ago

A big problem with the chat apps (ChatGPT; Claude.ai) is the weird context window hijinks. Especially ChatGPT does wild stuff.. sudden truncation; summarization; reinjecting 'ghost snippets' etc

I was thinking this should be up to the user (do you want to continue this conversation with context rolling out of the window or start a new chat) but now I realized that this is inevitable given the way pricing tiers and limited computation works. Like the only way to have full context is use developer tools like Google AI Studio or use a chat app that wraps the API

With a custom chat app that wraps the API you can even inject the current timestamp into each message and just ask the LLM btw every 10 minutes just make a new row in a markdown table that summarizes every 10 min chunk

isoprophlex · 2h ago

1M of input... at $6/1M input tokens. Better hope it can one-shot your answer.

elitan · 11m ago

have you ever hired humans?

simianwords · 1h ago

How does "supporting 1M tokens" really work in practice? Is it a new model? Or did they just remove some hard coded constraint?

eldenring · 30m ago

Serving a model efficiently at 1M context is difficult and could be much more expensive/numerically tricky. I'm guessing they were working on serving it properly, since its the same "model" in scores and such.

simianwords · 28m ago

Thanks - still not clear what they did really. Some inference time hacks?

xnx · 1h ago

1M context windows are not created equal. I doubt Claude's recall is as good as Gemini's 1M context recall. https://cloud.google.com/blog/products/ai-machine-learning/t...

xnx · 1h ago

Good analysis here: https://news.ycombinator.com/item?id=44878999

> the model that’s best at details in long context text and code analysis is still Gemini.

> Gemini Pro and Flash, by comparison, are far cheaper

dang · 40m ago

Related ongoing thread:

Claude vs. Gemini: Testing on 1M Tokens of Context - https://news.ycombinator.com/item?id=44878999 - Aug 2025 (9 comments)

ffitch · 41m ago

I wonder how modern models fair on NovelQA and FLenQA (benchmarks that test ability to understand long context beyond needle in a haystack retrieval). The only such test on a reasoning model that I found was done on o3-mini-high (https://arxiv.org/abs/2504.21318), it suggests that reasoning noticeably improves FLenQA performance, but this test only explored context up to 3,000 tokens.

siva7 · 11m ago

Ah, so claude code on subscription will become a crippled down version

falcor84 · 2h ago

Strange that they don't mention whether that's enabled or configurable in Claude Code.

CharlesW · 2h ago

From a co-marketing POV, it's considered best practice to not discuss home-grown offerings in the same or similar category as products from the partners you're featuring.

It's likely they'll announce this week, albeit possibly just within the "what's new" notes that you see when Claude Code is updated.

csunoser · 2h ago

They don't say it outright. But I think it is not in Claude Code yet.

> We’re also exploring how to bring long context to other Claude products. - Anthropic

That is, any other product that is not Anthropic API tier 4 or Amazon bedrock.

farslan · 2h ago

Yeah same, I'm curious about this. I would guess it's by default enabled with Claude Code.

pmxi · 35m ago

The reason I initially got interested in Claude was because they were the first to offer a 200K token context window. That was massive in 2023. However, they didn't keep up once Gemini offered a 1M token window last year.

I'm glad to see an attempt to return to having a competitive context window.

qsort · 2h ago

I won't complain about a strict upgrade, but that's a pricy boi. Interesting to see differential pricing based on size of input, which is understandable given the O(n^2) nature of attention.

lherron · 1h ago

Wow, I thought they would feel some pricing pressure from GPT5 API costs, but they are doubling down on their API being more expensive than everyone else.

sebzim4500 · 1h ago

I think it's the right approach, the cost of running these things as coding assistants is negligable compared to the benefit of even a slight model improvement.

AtNightWeCode · 24m ago

GPT5 API uses more tokens for answers of the same quality as previous versions. Fell into that trap myself. I use both Claude and OpenAI right now. Will probably drop OpenAI since they are obviously not to be trusted considering the way they do changes.

varyherb · 1h ago

I believe this can be configured in Claude Code via the following environment variable:

ANTHROPIC_BETAS="context-1m-2025-08-07" claude

Someone1234 · 1h ago

Before this they supposedly had a longer context window than ChatGPT, but I have workloads that abuse the heck out of context windows (100-120K tokens). ChatGPT genuinely seems to have a 32K context window, in the sense that is legitimately remembers/can utilize everything within that window.

Claude previously had "200K" context windows, but during testing it wouldn't even hit a full 32K before hitting a wall/it forgetting earlier parts of the context. They also have extremely short prompt limits relative to the other services around, making it hard to utilize their supposedly larger context windows (which is suspicious).

I guess my point is that with Anthropic specifically, I don't trust their claims because that has been my personal experience. It would be nice if this "1M" context window now allows you to actually use 200K though, but it remains to be seen if it can even do that. As I said with Anthropic you need to verify everything they claim.

Etheryte · 1h ago

Strong agree, Claude is very quick to forget things like "don't do this", "never do this" or things it tried that were wrong. It will happily keep looping even in very short conversations, completely defeating the purpose of using it. It's easy to game the numbers, but it falls apart in the real world.

chrisweekly · 53m ago

Peer of this post currently also on HN front page, comparing perf for Claude vs Gemini, w/ 1M tokens: https://news.ycombinator.com/item?id=44878999

film42 · 1h ago

The 1M token context was Gemini's headlining feature. Now, the only thing I'd like Claude to work on is tokens counted towards document processing. Gemini will often bill 1/10th the tokens Anthropic does for the same document.

jbellis · 1h ago

Just completed a new benchmark that sheds some light on whether Anthropic's premium is worth it.

(Short answer: not unless your top priority is speed.)

https://brokk.ai/power-rankings

rcanepa · 48m ago

I recently switched to the $200 CC subscription and I think I will stay with it for a while. I briefly tested whatever version of ChatGPT 5 comes with the free Cursor plan and it was unbearably slow. I could not really code with it as I was constantly getting distracted while waiting for a response. So, speed matters a lot for some people.

24xpossible · 1h ago

Why no Grok 4?

mettamage · 2h ago

Shame it's only the API. Would've loved to see it via the web interface on claude.ai itself.

minimaxir · 2h ago

Can you even fit 200+k tokens worth of context in the web interface? IMO Claude's API workbench is the worst of the three major providers.

data-ottawa · 1h ago

When working on artifacts after a few change requests it definitely can.

mettamage · 2h ago

Via text files right? Just drag and drop.

77pt77 · 1h ago

Even if you can't, a conversation can easily get larger than that.

fblp · 1h ago

I assume this will mean that long chats continue to get the "prompt is too long" error?

greenfish6 · 2h ago

Yes, but if you look in the rate limit notes, the rate limit is 500k tokens / minite for tier 4, which we are on. Given how stingy anthropic has been with rate limit increases, this is for very few people right now

alienbaby · 1h ago

The fracturing of all the models offered across providers is annoying. The number of different models and the fact a given model will have different capabilities from different providers is ridiculous.

thimabi · 1h ago

Oh, well, ChatGPT is being left in the dust…

When done correctly, having one million tokens of context window is amazing for all sorts of tasks: understanding large codebases, summarizing books, finding information on many documents, etc.

Existing RAG solutions fill a void up to a point, but they lack the precision that large context windows offer.

I’m excited for this release and hope to see it soon on the UI as well.

OutOfHere · 1h ago

Fwiw, OpenAI does have a decent active API model family of GPT-4.1 with a 1M context. But yes, the context of the GPT-5 models is terrible in comparison, and it's altogether atrocious for the GPT-5-Chat model.

The biggest issue in ChatGPT right now is a very inconsistent experience, presumably due to smaller models getting used even for paid users with complex questions.

pupppet · 2h ago

How does anyone send these models that much context without it tripping over itself? I can't get anywhere near that much before it starts losing track of instruction.

9wzYQbTYsAIc · 2h ago

I’ve been having decent luck telling it to keep track of itself in a .plan file, not foolproof, of course, but it has some ability to “preserve context” between contexts.

Right now I’m experimenting with using separate .plan files for tracking key instructions across domains like architecture and feature decisions.

CharlesW · 1h ago

> I’ve been having decent luck telling it to keep track of itself in a .plan file, not foolproof, of course, but it has some ability to “preserve context” between contexts.

This is the way. Not only have I had good luck with both a TASKS.md and TASKS-COMPLETE.md (for history), but I have an .llm/arch full of AI-assisted, for-LLM .md files (auth.md, data-access.md, etc.) that document architecture decisions made along the way. They're invaluable for effectively and efficiently crossing context chasms.

olddustytrail · 1h ago

I think it's key to not give it contradictory instructions, which is an easy mistake to make if you forget where you started.

As an example, I know of an instance where the LLM claimed it had tried a test on its laptop. This obviously isn't true so the user argued with it. But they'd originally told it that it was a Senior Software Engineer so playing that role, saying you tested locally is fine.

As soon as you start arguing with those minor points you break the context; now it's both a Software Engineer and an LLM. Of course you get confused responses if you do that.

pupppet · 1h ago

The problem I often have is I may have instruction like-

General instruction: - Do "ABC"

If condition == whatever: - Do "XYZ" instead

I have a hard time making the AI obey the instances I wish to override my own instruction and without having full control of the input context, I can't just modify my 'General Instruction' on a case by case basis to simply avoid having to contradict myself.

ramoz · 1h ago

Awesome addition to a great model.

The best interface for long context reasoning has been AIStudio by Google. Exceptional experience.

I use Prompt Tower to create long context payloads.

shamano · 1h ago

1M tokens is impressive, but the real gains will come from how we curate context—compact summaries, per-repo indexes, and phase resets. Bigger windows help; guardrails keep models focused and costs predictable.

DiabloD3 · 50m ago

Neat. I do 1M tokens context locally, and do it entirely with a single GPU and FOSS software, and have access to a wide range of models of equivalent or better quality.

Explain to me, again, how Anthropic's flawed business model works?

ZeroCool2u · 1h ago

It's great they've finally caught up, but unfortunate it's on their mid-tier model only and it's laughably expensive.

faangguyindia · 2h ago

In my testing the gap between claude and gemini pro 2.5 is close. My company is in asia pacific and we can't get access to claude via vertex for some stupid reason.

but i tested it via other providers, the gap used to be huge but now not.

film42 · 1h ago

Agree but pricing wise, Gemini 2.5 pro wins. Gemini input tokens are half the cost of Claude 4. Output is $5/million cheaper than Claude. But, document processing is significantly cheaper. A 5MB PDF (customer invoice) with Gemini is like 5k tokens vs 56k with Claude.

The only downside with Gemini (and it's a big one) is availability. We get rate limited by their dynamic QoS all the time even if we haven't reached our quota. Our GCP sales rep keeps recommending "provisioned throughput," but it's both expensive, and doesn't fit our workload type. Plus, the VertexAI SDK is kind of a PITA compared to Anthropic.

penguin202 · 2h ago

Claude doesn't have a mid-life crisis and try to `rm -rf /` or delete your project.

Tostino · 2h ago

For me the gap is pretty large (in Gemini Pro 2.5's favor).

For reference, the code I am working on is a Spring Boot / (Vaadin) Hilla multi-module project with helm charts for deployment and a separate Python based module for ancillary tasks that were appropriate for it.

I've not been able to get any good use out of Sonnet in months now, whereas Gemini Pro 2.5 has (still) been able to grok the project well enough to help out.

jona777than · 2h ago

I initially found Gemini Pro 2.5 to work well for coding. Over time, I found Claude to be more consistently productive. Gemini Pro 2.5 became my go-to for use cases benefitting from larger context windows. Claude seemed to be the safer daily driver (if I needed to get something done.)

All that being said, Gemini has been consistently dependable when I had asks that involved large amounts of code and data. Claude and the OpenAI models struggled with some tasks that Gemini responsively satisfied seemingly without "breaking a sweat."

Lately, it's been GPT-5 for brainstorming/planning, Claude for hammering out some code, Gemini when there is huge data/code requirements. I'm curious if the widened Sonnet 4 context window will change things.

faangguyindia · 1h ago

when gemini 2.5 pro gets stuck, i often use deep seek r1 in architect mode and qwen3 in coder mode in aider and it solves all the problem

last month i ran into some wicked dependency bug and only chatgpt could solve it which i am guessing is the case because it has hot data from github?

On the other hand, i really need a tool like aider where i can use various models in "architect" and "coder" mode.

what i've found is better reasoning models tend to be bad at writing actual code, and models like qwen3 coder seems better.

deep seek r1 will not write reliable code but it will reason well and map out the path forward.

i wouldn't be surprised if sonnets success was doing EXACTLY this behind the scenes.

but now i am looking for pure models who do not use this black magic hack behind API.

I want more control at tool end where i can alter the prompts and achieve results i want

this is one reason i do not use claude code etc....

aider is 80% of what i want wish it had more of what i want though.

i just don't know why no one has build a perfect solution to this yet.

Here are things i am missing in aider

1. Automatic model switching, use different models for asking questions about the code, different one for planning a feature, different one for writing actual code.

2. Self determine, if a feature needs a "reasoning" model or coding model will suffice.

3. be able to do more, selectively send context and drop the files we don't need. Intelligently add files to context which will be touched by the feature, not after having done all code planning asking to add files, then doing it all over again with more context available.

llm_nerd · 2h ago

Opus 4.1 is a much better model for coding than Sonnet. The latter is good for general queries / investigations or to draw up some heuristics.

I have paid subscriptions to both Gemini Pro and Claude. Hugely worthwhile expense professionally.

kotaKat · 1h ago

A million tokens? Damn, I’m gonna need a lot of quarters to play this game at Chuck-E-Cheese.

penguin202 · 2h ago

But will it remember any of it, and stop creating new redundant files when it can't find or understand what its looking for?

logicchains · 25m ago

With that pricing I can't imagine why anyone would use Claude Sonnet through the API when Gemini 2.5 Pro is both better and cheaper (especially at long-context understanding).

lvl155 · 1h ago

Only time this is useful is to do init on a sizable code base or dump a “big” csv.

tosh · 1h ago

How did they do the 1M context window?

Same technique as Qwen? As Gemini?

rootnod3 · 2h ago

So, more tokens means better but at the same time more tokens means it distracts itself too much along the way. So at the same time it is an improvement but also potentially detrimental. How are those things beneficial in any capacity? What was said last week? Embrace AI or leave?

All I see so far is: don't embrace and stay.

rootnod3 · 1h ago

So, I see this got downvoted. Instead of just downvoting, I would prefer to have a counter-argument. Honestly. I am on the skeptic side of LLM, but would not mind being turned to the other side with some solid arguments.

henriquegodoy · 1h ago

Thats incredible to see how ai models are improving, i'm really happy with this news. (imo it's more impactful than the release of gpt5) now, we need more tokens per second, and then the self-improvement of the model will accelerate.

alvis · 2h ago

Context window after certain size doesn’t bring in much benefit but higher bill. If it still keeps forgetting instructions it would be just much easier to be ended up with long messages with higher context consumption and hence the bill

I’d rather having an option to limit the context size

EcommerceFlow · 2h ago

It does if you're working with bigger codebases. I've found copy/pasting my entire codebase + adding a <task> works significantly better than cursor.

spiderice · 23m ago

How does one even copy their entire codebase? Are you saying you attach all the files? Or you use some script to copy all the text to your clipboard? Or something else?

whalesalad · 40m ago

My first thought was "gg no re" can't wait to see how this changes compaction requirements in claude code.

nickphx · 1h ago

Yay, more room for stray cats.

andrewstuart · 2h ago

Oh man finally. This has been such a HUGE advantage for Gemini.

Could we please have zip files too? ChatGPT and Gemini both unpack zip files via the chat window.

Now how about a button to download all files?

deadbabe · 1h ago

Unfortunately, larger context isn’t really the answer after a certain point. Small focused context is better, lazily throwing a bunch of tokens in as a context is going to yield bad results.

rafaelero · 2h ago

god they keep raising prices

revskill · 1h ago

The critical issue with LLM which never beats human: break what worked.

artursapek · 2h ago

Eagerly waiting for them to do this with Opus

irthomasthomas · 2h ago

Imagine paying $20 a prompt?

datadrivenangel · 1h ago

Depending on how many prompts per hour you're looking at, that's probably same order of magnitude as expensive SAAS. A fancy CRM seat can be ~$2000 per month (or more), which assuming 50 hours per week x 4 weeks per month is $10 per hour ($2000/200 hours). A lot of money, but if it makes your sales people more productive, it's a good investment. Assuming that you're paying your sales people say 240K per year, ($20,000 per month), then the SAAS cost is 10% of their salary.

This explains DataDog pricing. Maybe it will give a future look at AI pricing.

artursapek · 1h ago

If I can give it a detailed spec, walk away and do something else for 20 minutes, and come back to work that would have taken me 2 hours, then that's a steal.

throwaway888abc · 2h ago

holy moly! awesome

1xer · 2h ago

moaaaaarrrr

markb139 · 33m ago

I’ve tried 2 AI tools recently. Neither could produce the correct code to calculate the CPU temperature on a Raspberry Pi RP2040. The code worked, looked ok and even produced reasonable looking results - until I put a finger on the chip and thus raised the temp. The calculated temperature went down. As an aside the free version of chatGPT didn’t know about anything newer than 2023 so couldn’t tell me about the RP2350

anvuong · 30m ago

How can you be sure putting the finger on the chip raise the temp? If you feel hot that means heat from the chip is being transferred to your finger, that may decrease the temp, no?

broshtush · 27m ago

From my understanding putting your finger on an uncooled CPU acts like a passive cooler, thus actually decreasing temperature.

ghjv · 22m ago

wouldn't your finger have acted as a heat sink, lowering the temp? sounds like the program may have worked correctly. could be worth trying again with a hot enough piece of metal instead of your finger

fwip · 27m ago

I don't think a larger context window would help with that.

Claude Sonnet 4 now supports 1M tokens of context (anthropic.com)

Let's get real about the one-person billion dollar company (marcrand.com)

Show HN: Omnara – Run Claude Code from Anywhere (github.com)

Show HN: Building a web search engine from scratch with 3B neural embeddings (blog.wilsonl.in)

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M² (sdo.group)

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

Training language models to be warm and empathetic makes them less reliable (arxiv.org)

The "high-level CPU" challenge (yosefk.com)

Why are there so many rationalist cults? (asteriskmag.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Is the A.I. Boom Turning Into an A.I. Bubble? (newyorker.com)

RISC-V single-board computer for less than 40 euros (heise.de)

StarDict sends X11 clipboard to remote servers (lwn.net)

Australian court finds Apple, Google guilty of being anticompetitive (ghacks.net)

H-1B Visa Changes Approved by White House (newsweek.com)

Nexus: An Open-Source AI Router for Governance, Control and Observability (nexusrouter.com)

Modos Paper Monitor – Open-hardware e-paper monitor and dev kit (crowdsupply.com)

A Spellchecker Used to Be a Major Feat of Software Engineering (prog21.dadgum.com)

Debian GNU/Hurd 2025 released (lists.gnu.org)

Enlisting in the Fight Against Link Rot (jszym.com)

Evaluating LLMs Playing Text Adventures (entropicthoughts.com)

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [pdf] (arxiv.org)

The Ancient Art and Intimate Craft of Artificial Eyes (thereader.mitpress.mit.edu)

Claude vs. Gemini: Testing on 1M Tokens of Context (every.to)

Monero appears to be in the midst of a successful 51% attack (twitter.com)

GitHub is (again) having issues (githubstatus.com)

The Article in the Most Languages (en.wikipedia.org)

How to become your own ISP (WHY2025) [video] (media.ccc.de)

The ex-CIA agents deciding Facebook's content policy (2022) (mronline.org)

Claude Code is all you need (dwyer.co.za)

Qodo CLI agent scores 71.2% on SWE-bench Verified (qodo.ai)

GitHub is no longer independent at Microsoft after CEO resignation (theverge.com)

Artificial biosensor can better measure the body's main stress hormone (medicalxpress.com)

Starbucks in Korea asks customers to stop bringing in printers/desktop computers (fortune.com)

Wikipedia loses challenge against Online Safety Act (bbc.com)

Perplexity Makes Longshot $34.5B Offer for Chrome (wsj.com)

New 3D Laser Scanner Developed for Harvesting Robots (uni-wuerzburg.de)

That viral video of a 'deactivated' Tesla Cybertruck is a fake (theverge.com)

I tried every todo app and ended up with a .txt file (al3rez.com)

All known 49-year-old Apple-1 computers (apple1registry.com)

Show HN: I built an offline, open‑source desktop Pixel Art Editor in Python (github.com)

Undefined Behavior in C and C++ (2024) (russellw.github.io)

High-severity WinRAR 0-day exploited for weeks by 2 groups (arstechnica.com)

What does it mean to be thirsty? (quantamagazine.org)

Weathering Software Winter (2022) (100r.co)

Neki – Sharded Postgres by the team behind Vitess (planetscale.com)

FreeBSD Scheduling on Hybrid CPUs (wiki.freebsd.org)

Why We Migrated from Neon to PlanetScale (blog.opensecret.cloud)

Generic drugs, dirty plants, and FDA exemptions (propublica.org)

Radicle 1.3.0 (radicle.xyz)

Claude Sonnet 4 now supports 1M tokens of context

Comments (223)