NSIDC: Degraded delivery for Arctic sea ice data from US Navy (nsidc.org)

The lazy architect (OpenAI’s o3). o3 is incredibly lazy at writing code, but very good at planning. Will happily read tens of files and do deep analysis, but often struggles in scenarios where it needs to edit more than one file.

The over-eager child (Claude Sonnet 3.7 Thinking). Claude Sonnet is eager to just get going, man! It’s not the most careful, and in longer strings of tool calls, may start editing something completely unrelated to what you asked it to.

Pretty balanced?(Gemini 2.5 Pro). Gemini 2.5 is a little more intelligent, and significantly faster and more reserved than Sonnet 3.7. Usually the best choice for writing code in multiple files.

I’ve found o4-mini to be incredibly slow and fairly mediocre, and GPT 4.1 useful in very situational areas. My tips:

- Use o3 to plan and/or write code in one or max two file only. If you do more, it may openly revolt and just refuse to write any longer.

- Always make sure Sonnet 3.7 is following a tightly scoped plan on a relatively small section of the product, and supervise it. If you have an easy change to make in many areas of your codebase, for example, letting Sonnet run, still supervised, is a perfect use of the model’s persona

Generally what I do:

- Medium complexity: editing one file: o3. Editing multiple files: plan with o3, write with gemini-2.5

- Simple complexity: Editing many files, very simple: plan with o3 if needed, write with claude-3.7. Editing many files, simple, needs formulaic approach: write a detailed prompt into GPT 4.1

- High complexity: plan with o3, separate into multiple chunks, write small chunks at a time with gemini-2.5 and be very careful with each section. If I'm super lazy sometimes I just YOLO all of the sections and then fix all the bugs at the end but this probably leads to code issues later down the line.

Would love to hear other people are using the different models!

Comments (5)

joegibbs · 32d ago

I really like the code that Gemini 2.5 Pro writes but it tends to stop for no reason and needs to be reprompted to start again. I'm not sure why this is. Also, what's the difference between 2.5 Pro and 2.5 Pro Max? Or Claude 3.7 and 3.7 Max?

Aside: it would be good for Cursor to add something to tell their agents not to run tool calls that run forever (like test watchers). I add this in my .mdc files but I think it would be a good default so that it can run tests, update the code, run them again until it works.

mike210 · 32d ago

I sometimes but rarely turn on Max for Gemini for more context in a long conversations. The tool use (5 cents per tool call) can get pretty ridiculous on Claude 3.7 Sonnet Max and I've had calls that have been ~$2.

muzani · 31d ago

Sonnet 3.5 has a very different personality. It's less skilled, but often I opt for it because of the personality.

Deepseek is actually pretty good and underappreciated too. It feels unreliable though. Downside is tool use, but I prefer it over o3.

mike210 · 29d ago

Interesting - what kinds of tasks do you reach for 3.5?

muzani · 28d ago

Pretty much whatever you're using 3.7 for. You don't need as tight a scope. It does easy things well.

An situation I had yesterday: we had two dropdowns. For simplicity, let's say it's country. When you pick a different country, it shows states. When you select state, then change country, it crashes because the state doesn't exist in the new country.

The standard solution is simple – just make it reset to null when switching countries, or better yet, check whether the selected state exists in the new country. But the thinking models will overengineer the hell out of this. They'll check from the deep service level when these checks can be made just below the view layer.

NSIDC: Degraded delivery for Arctic sea ice data from US Navy (nsidc.org)

Show HN: I Made a Simple Tool to Collect Client Feedback for Freelancers (usetestimo.vercel.app)

OpenBSD IO Benchmarking: How Many Jobs Are Worth It? (rsadowski.de)

Trump's Useful Idiots (chrishedges.substack.com)

Zig Devlog: Self-Hosted x86 Back End Is Now Default in Debug Mode (ziglang.org)

Rocket Mail (en.wikipedia.org)

Palantir's Collection of Disease Data at CDC Stirs Privacy Concerns (nytimes.com)

The Sixties Come Back to Life in "Everything Is Now" (newyorker.com)

Bobby Tables: A guide to preventing SQL injection (bobby-tables.com)

YouTuber wants to buy the Commodore brand (tedium.co)

Eight US states seek to outlaw chemtrails – even though they aren't real (theguardian.com)

Poison everywhere: No output from your MCP server is safe (cyberark.com)

Show HN: Summarize – I built a tool to tackle my reading list (summarize.stream)

The Looming Problem of Slow and Brittle Proofs in SMT Verification (kirancodes.me)

The next bubble that will pop: Big Social Media (blog.hermesloom.org)

History Repeats (furbo.org)

There's not much point in buying Commodore (oldvcr.blogspot.com)

Researchers discover evidence in the mystery of America's 'Lost Colony' (foxnews.com)

The life of a 24/7 streamer: 'What more do you want?' (washingtonpost.com)

Swiff army knife for generating fcpxml (final cut pro) files (github.com)

Lisp Quotes (heuristos.net)

The AI Side Hustle Revolution: Turning Niche Ideas into Income (spicermatthews.com)

I tried to make something in America (youtube.com)

Billy Idol – It's a Nice Day to Tour Again [video] (youtube.com)

How does it feel to be an AI? I've asked one and together we made a song [video] (youtube.com)

tcpulse: A TCP/UDP load generator that provides fine-grained, flow-level control (github.com)

Rust Scripting Languages Benchmark (github.com)

Bits or Pieces?: IT Extremists (blog.gardeviance.org)

Scrappy – Make little apps for you and your friends (pontus.granstrom.me)

Svelte in Orbit [video] (youtube.com)

Building supercomputers for autocrats probably isn't good for democracy (helentoner.substack.com)

Show HN: Hexplain – Making medical papers accessible with AI (hexplain.ai)

Bliss – The story behind one of the most famous photographs (2012) (amateurphotographer.com)

From Spain to Mecca on horseback: The men performing Hajj like medieval pilgrims (middleeasteye.net)

Cooling Chips Still a Top Challenge (semiengineering.com)

Word Tour: One-Dimensional Word Embeddings via the Traveling Salesman Problem (data-processing.club)

Show HN: Dolph AI – A Focus App That Rewards You with AI Chat Time (trydolph.com)

Show HN: Keep just your links and notes together (eyeball.wtf)

What Is This? (reddit.com)

Current Challenges in Publicly Available Chemistry GPTs (pubs.acs.org)

Windows Vista and 7 OOBE running on Windows 8 and above (winclassic.net)

Why Android can't use CDC Ethernet (2023) (jordemort.dev)

Language Change Can Affect Personality (societalk.com)

14 Rare Minerals Found Exclusively in One Square Mile on Earth (go2tutors.com)

We’re secretly winning the war on cancer (vox.com)

Nanowar of Steel – HelloWorld.java (github.com)

Omnimax (computer.rip)

Building for social good: Looking for a technical co-founder

U.S. military trims access to its critical sea ice measurements (science.org)

Show HN: Text receipts directly into Concur (for expense management) (textspense.com)

I made 4000 agent calls in Cursor last month. Each model has a personality

Comments (5)