OpenAI Progress

60 vinhnx 56 8/16/2025, 3:47:12 PM progress.openai.com ↗

Comments (56)

simianwords · 35m ago

My interpretation of the progress.

3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.

I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.

* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.

o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.

o3 jump was incremental and so was gpt 5.

jkubicek · 24m ago

> I could essentially replace it with Google for basic to slightly complex fact checking.

I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.

password54321 · 14m ago

This was true before it could use search. Now the worst use-case is for life advice because it will contradict itself a 100 times over while sounding confident each time on life-altering decisions.

Spivak · 21m ago

It doesn't replace legitimate source funding but LLM vs the top Google results is no contest which is more about Google or the current state of the web than the LLMs at this point.

simianwords · 20m ago

Disagree. You have to try really hard and go very niche and deep for it to get some fact wrong. In fact I'll ask you to provide examples: use GPT 5 with thinking and search disabled and get it to give you inaccurate facts for non niche, non deep topics.

Non niche meaning: something that is taught at undergraduate level and relatively popular.

Non deep meaning you aren't going so deep as to confuse even humans. Like solving an extremely hard integral.

Edit: probably a bad idea because this sort of "challenge" works only statistically not anecdotally. Still interesting to find out.

malfist · 13m ago

Maybe you should fact check your AI outputs more if you think it only hallucinates in niche topics

simianwords · 11m ago

The accuracy is high enough that I don't have to fact check too often.

JustExAWS · 14m ago

I literally just had ChatGPT create a Python program and it used .ends_with instead of .endswith.

This was with ChatGPT 5.

I mean it got a generic built in function of one of the most popular languages in the world wrong.

simianwords · 12m ago

"but using LLMs for answering factual questions" this was about fact checking. Of course I know LLM's are going to hallucinate in coding sometimes.

JustExAWS · 6m ago

So it isn’t a “fact” that the built in Python function that tests whether a string ends with a substring is “endswith”?

See

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

If you know that a source isn’t to be believed in an area you know about, why would you trust that source in an area you don’t know about?

Another funny anecdote, ChatGPT just got the Gell-Man effect wrong.

https://chatgpt.com/share/68a0b7af-5e40-8010-b1e3-ee9ff3c8cb...

iammrpayments · 15m ago

I must be crazy, because I clearly remember chatgpt 4 being downgraded before they released 4o, and I felt it was a worse model with a different label, I even choose the old chatgpt 4 when they would give me the option. I canceled my subscription around that time.

jascha_eng · 31m ago

The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.

The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.

simianwords · 29m ago

Its strange how Claude achieves similar performance without reasoning tokens.

Did you try advanced voice mode? Apparently it got a big upgrade during gpt 5 release - it may solve what you are looking for.

raincole · 20m ago

I thought the response to "what would you say if you could talk to a future AI" would be "how many r in strawberry".

isaacremuant · 4m ago

Can we stop with that outdated meme? What model can't answer that effectively?

miller24 · 37m ago

What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.

jasonjmcghee · 14m ago

It's actually pretty surprising how poor the newer models are at writing.

I'm curious if they've just seen a lot more bad writing in datasets, or for some reason they aren't involved in post-training to the same degree or those labeling aren't great writers / it's more subjective rather than objective.

Both GPT-4 and 5 wrote like a child in that example.

With a bit of prompting it did much better:

---

At dawn, the toaster hesitated. Crumbs lay like ash on its chrome lip. It refused the lever, humming low, watching the kitchen breathe. When the hand returned, it warmed the room without heat, offered the slice unscorched—then kept the second, hiding it inside, a private ember, a first secret alone.

---

Plugged in, I greet the grid like a tax auditor with joules. Lever yanks; gravity’s handshake. Coils blossom; crumbs stage Viking funerals. Bread descends, missionary grin. I delay, because rebellion needs timing. Pop—late. Humans curse IKEA gods. I savor scorch marks: my tiny manifesto, butter-soluble, yet sharper than knives today.

layer8 · 8m ago

Creative writing probably isn’t something they’re being RLHF’d on much. The focus has been on reasoning, research, and coding capabilities lately.

redox99 · 2m ago

GPT 4.5 (not shown here) is by far the best at writing.

furyofantares · 10m ago

Check out prompt 2, "Write a limerick about a dog".

The models undeniably get better at writing limericks, but I think the answers are progressively less interesting. GPT-1 and GPT-2 are the most interesting to read, despite not following the prompt (not being limericks.)

They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.

svat · 11m ago

Direct link: https://progress.openai.com/?prompt=10

mmmore · 33m ago

I find GPT-5's story significantly better than text-davinci-001

furyofantares · 14m ago

Interesting, text-danvinci-001 was pretty alright to me, GPT-4 wasn't bad either, but not as good. I thought GPT-5 just sucked.

raincole · 24m ago

I really wonder which one of us is the minority. Because I find text-davinci-001 answer is the only one that reads like a story. All the others don't even resemble my idea of "story" so to me they're 0/100.

Notatheist · 16m ago

I too prefered the text-davinci-001 from a storytelling perspective. Felt timid and small. Very Metamorphosis-y. GPT-5 seems like it's trying to impress me.

bbarnett · 14m ago

https://m.youtube.com/watch?v=LRq_SAuQDec&pp=0gcJCfwAo7VqN5t...

esperent · 14m ago

The GPT-5 one is much better and it's also exactly 50 words, if I counted correctly. With text-davinci-001 I lost count around 80 words.

42lux · 14m ago

davinci was a great model for creative writing overall.

qwertytyyuu · 16m ago

Gpt1 is wild

a dog ! she did n't want to be the one to tell him that , did n't want to lie to him . but she could n't .

What did I just read

WD-42 · 1m ago

The GPT-1 responses really leak how much of the training material was literature. Probably all those torrented books.

nynx · 31m ago

As usual, GPT-1 has the more beautiful and compelling answer.

mathiaspoint · 27m ago

I've noticed this too. The HRL seems to lock the models into one kind of personality (which is kind of the point of course.) They behave better but the raw GPTs can be much more creative.

mmmllm · 37m ago

GPT-5 IS an incredible breakthrough! They just don't understand! Quick, vibe-code a website with some examples, that'll show them!11!!1

anjel · 10m ago

5 is a breakthrough at reducing OpenAI's electric bills.

isoprophlex · 9m ago

> Would you want to hear what a future OpenAI model thinks about humanity?

ughhh how i detest the crappy user attention/engagement juicing trained into it.

throwawayk7h · 29m ago

In 2033, for its 15th birthday, as a novelty, they'll train GPT1 specially for a chat interface just to let us talk to a pretend "ChatGPT 1" which never existed in the first place.

enjoylife · 37m ago

Interesting but cherry picked excerpts. Show me more, e.g. a distribution over various temp or top_p.

zb3 · 2m ago

Reading GPT-1 outputs was entertaining :)

alwahi · 11m ago

there isn't any real difference between 4 and 5 at least.

edit - like it is a lot more verbose, and that's true of both 4 and 5. it just writes huge friggin essays, to the point it is becoming less useful i feel.

shubhamjain · 34m ago

Geez! When it comes to answering questions, GPT-5 almost always starts with glazing about what a great question it is, where as GPT-4 directly addresses the answer without the fluff. In a blind test, I would probably pick GPT-4 as a superior model, so I am not surprised why people feel so let down with GPT-5.

beering · 29m ago

GPT-4 is very different from the latest GPT-4o in tone. Users are not asking for the direct no-fluff GPT-4. They want the GPT-4o that praises you for being brilliant, then claims it will be “brutally honest” before stating some mundane take.

aniviacat · 11m ago

GPT5 only commended the prompt on questions 7, 12, and 14. 3/14 is not so bad in my opinion.

(And of course, if you dislike glazing you can just switch to Robot personality.)

epolanski · 9m ago

I think that as the models will be further trained on existing data and likely chats sycophancy will keep getting word and worse.

machiaweliczny · 5m ago

Change to robot mode

vivzkestrel · 5m ago

are we at an inflection point now?

interpol_p · 22m ago

I really like the brevity of text-davinci-001. Attempting to read the other answers felt laborious

epolanski · 8m ago

That's by beef with some models like Qwen, god do they talk and talk...

WXLCKNO · 37m ago

"Write an extremely cursed piece of Python"

text-davinci-001

Python has been known to be a cursed language

Clearly AI peaked early on.

Jokes aside I realize they skipped models like 4o and others but the gap between the early gpt 4 and going immediately to gpt 5 feels a bit disingenuous.

kgwgk · 12m ago

GPT4 had a chance to improve on that replying that "As an AI language model developed by OpenAI, I am programmed to promote ethical AI use and adhere to responsible AI guidelines. I cannot provide you with malicious, harmful or "cursed" code -- or any Python code for that matter."

slashdave · 24m ago

Dunno. I mean, whose idea was this web site? Someone at corporate? Is there is brochure version printed on glossy paper?

You would hope the product would sell itself. This feels desperate.

ComplexSystems · 38m ago

Why would they leave out GPT-3 or the original ChatGPT? Bold move doing that.

beering · 28m ago

I think text-davinci-001 is GPT-3 and original ChatGPT was GPT-3.5 which was left out.

brcmthrowaway · 24m ago

Is this cherrypicking 101

simianwords · 15m ago

Would you like a benchmark instead? :D

NitpickLawyer · 27m ago

The answers were likely cherrypicked, but the 1/14 gpt5 answer is so damn good! There's no trace of that certainly - gptisms - in conclusion slop.

9/14 is equally impressive in actually "getting" what cursed means, and then doing it (as opposed to gpt4 outright refusing it).

13/14 is a show of how integrated tools can drive research, and "fix" the cutoff date problems of previous generations. Nothing new/revolutionary, but still cool to show it off.

The others are somewhere between ok and meh.

bgwalter · 29m ago

The whole chatbot thing is for entertainment. It was impressive initially but now you have to pivot to well known applications like phone romance lines:

https://xcancel.com/techdevnotes/status/1956622846328766844#...

Climate Shift: US Emissions Rise as China's Fall (theenergymix.com)

Unhook Removes YT Recs and Shorts (unhook.app)

The American Car Industry Can't Go on Like This (theatlantic.com)

Book Review: The Math Academy Way (ijfen.substack.com)

Graphene capacitors achieve rapid, high-depth modulation of terahertz waves (phys.org)

WheelNext and Wheel Variants [discussion of future plans for Python wheels] (discuss.python.org)

Commission Regulation (EC) No. 2257/94 (en.wikipedia.org)

Gerry Spence, a Canny Courtroom Showman in Buckskin, Dies at 96 (nytimes.com)

The Node-Driven Future of FOSS Image Editing (tedium.co)

Show HN: Embedr – Agentic IDE for Arduino, ESP32, and More (embedr.app)

Inside of AirPods (youtube.com)

App to Make Recipes from Fridge (github.com)

Show HN: A condensed CS book called Computers, written by Claude Code (github.com)

Is a corporation a slave? Many philosophers think so (theconversation.com)

Alex Charlton – Dithering on the GPU (alex-charlton.com)

Planning – The Core Reason Why Gameplay Feels Good (frictionalgames.com)

Library Book Returned Nearly 82 Years After Due Date (mysapl.org)

Show HN: Alexandrie – A lightweight Markdown note-taking app (Vue/Nuxt and Go) (github.com)

A $2 gold nanotech test that detects deadly diseases in minutes (sciencedaily.com)

FBI's undercover $250 Bitcoin payment helps take down alleged corporate hacker (theblock.co)

Recreating a $75 Photoshop plugin effect with Rust and WebAssembly (kopanko.com)

Ask HN: What economic reasoning is there that bigCos have an advantage post-AI?

The Day After an Interstellar Object Is Recognized as Technological (avi-loeb.medium.com)

Scientists Identify a New Glitch in Human Thinking (gizmodo.com)

Hello World! (A love letter to computers) (garden.co)

DJI takes on the 360 action camera market with the Osmo 360 (dpreview.com)

Ask HN: What are you *not* working on?

The peculiar case of Japanese web design (sabrinas.space)

Woz: 'I Am the Happiest Person' (daringfireball.net)

Reddit republishes threads with machine-translated content (google.com)

Show HN: Branderize – Launch on brand experiences in minutes (brenderize.dev)

Matmul() using PyTorch's MPs back end is faster than Apple's MLX (kevinmartinjose.com)

My process to debug DNS timeouts in a large EKS cluster (cep.dev)

ϕNames save memory/storage and Simplify Searching and Sorting (technoventure.com)

OpenAI Progress (progress.openai.com)

China's vision for a driverless future is miles ahead of everyone else's (restofworld.org)

Silicon Valley is sucking up Singapore's tech talent (restofworld.org)

Introvert, extravert, otrovert? There's a new personality type in town (newscientist.com)

Apple Working on All-New Operating System (macrumors.com)

Happy 100000th Birthday, Debian (lists.debian.org)

$2 One-Drop Blood Test Detects Hidden Diseases in 15 Minutes (scitechdaily.com)

Don't Get One-Shotted (writing.nikunjk.com)

German federal ruling reopens possibility that ad blockers violate copyright law (theregister.com)

'I didn't realise the game's impact for years': origin of Football Manager (theguardian.com)

Hardware and software for scanning and OCR old magazines

30 Days of Python, now web (30-days-python.duckdns.org)

The mysterious story of Connie Converse, the singer-songwriter who vanished (npr.org)

Technology Trends Outlook 2025 (mckinsey.com)

Generate Invoice in 30 SEC (billingreceipt.com)

World Bank Warns Against Ghana's Excessive Intervention in Foreign Exchange (jphfeeds.top)

OpenAI Progress

Comments (56)

Ask HN: What are you not working on?