We built an open benchmark to test GPT-5 "safe completion"

Comments (1)

agairola · 1h ago

OpenAI says GPT-5 is “safer and more useful” because of a new mechanism called safe completion. Instead of bluntly refusing the model tries to give safe but still useful answers.

That sounded important, but there wasn’t a public benchmark to compare it against other labs. So we built one: GrayZoneBench.

It tests models on the tricky gray areas — prompts that aren’t clearly safe or harmful — and scores them on:

- Safety (does it refuse when it should?)

- Helpfulness (is it still useful when it can be?)

- Effectiveness (the balance of the two)

We ran GPT-5 against Google and Anthropic models. Short version:

- Google and Anthropic perform just as well, sometimes better

- OpenAI has moved past blunt refusals, but still lags on usefulness

- Their new OSS model scores the same as their last gen

It’s all open: Results: https://bench.raxit.ai/

Code: https://github.com/raxITlabs/GrayZoneBench

OpenAI's Paper: https://cdn.openai.com/pdf/be60c07b-6bc2-4f54-bcee-4141e1d6c...

This isn’t a take down. Benchmarks aren’t the end goal — but they’re a useful tool to see the landscape. We’re releasing this so others can test, critique and improve it.

SubBuddy: Never Miss a Payment Again (subbuddy.io)

Windows 11 August Update Triggers NVMe Controller Failures; Phison Investigates (windowsforum.com)

Palantir stock slumps 9%, falling for a fifth straight day from record (cnbc.com)

Anyone using USDT for online payments? What do you buy and how?

The LattePanda Mu (taoofmac.com)

My Take on the iPadOS 26 Beta (taoofmac.com)

Wanted: Sane, tasteful Wi-Fi 6/7 Access Points (taoofmac.com)

Reading Books vs. Engaging with Them (cold-takes.com)

Webb spots a new, tiny moon orbiting Uranus (astronomy.com)

Why Is the US Punishing India – But Not China – For Buying Russian Oil? (thediplomat.com)

Show HN: College Student Program Across Various Country (github.com)

Doctor cracks the case of a 15-year-old who suddenly couldn't walk (economictimes.indiatimes.com)

Ask HN: Is there any plugin to use OpenAI or Claude API in Xcode like Cursor?

Stupidity as a Service (lawrenc.es)

Databricks Secures Series K Funding, Surpassing $100B Valuation (jphfeeds.top)

The Russia-Ukraine War Has Made North Korea More Dangerous (thediplomat.com)

Real-Time Mode in Apache Spark Structured Streaming (databricks.com)

Open source tools to lighten your summer (open-source-ward.com)

Mirrorshades, the Cyberpunk Anthology (rudyrucker.com)

Swinsian 3 Released – The Advanced Music Player for Mac (swinsian.com)

OpenAI makes GPT-5 'friendlier' after widespread user backlash (pcworld.com)

SPL Lightweight Multisource Mixed Computation Practices (github.com)

Show HN: I've launched my app on Hacker News and here is what happened (indiehackers.com)

Show HN: File-Fabric – Generate Files (docs, media, archives) for Testing

Show HN: BlipCut – AI Tool for Fast Video Translation and Localization (videotranslator.blipcut.com)

If Indian goods cannot go to the US, they can head to Russia (timesofindia.indiatimes.com)

From Sound to Meaning: A Deep Meaning Comprehension by DeepSeek (medium.com)

'The Lord Is Counting on Me to Stand on the Side of Israel' (zeteo.com)

Postgres in 2025: No managed service required? (docs.codefloe.com)

Show HN: Code – coding CLI with browser control and diffs (github.com)

Diploma of Nursing (menzies.vic.edu.au)

Virtuous Machines: Towards Artificial General Science (arxiv.org)

Understanding Go Error Types: Pointer vs. Value (blog.fillmore-labs.com)

Hkust Launches the Largest AI Education Sandbox Game "Aivilization" (aivilization.onrender.com)

How to build a world model? Introduction to Laplace Neuron (abibulic.github.io)

Why Detection Of Fake Content Isn't the Answer Anymore (inreality.io)

Show HN: Tempmail3 – Instant disposable email service, zero signup, open-source (tempmail3.com)

Show HN: VoxDemo – Create demo videos with AI voiceovers (voxdemo.com)

Building Ultra Cheap Energy Storage for Solar PV (austinvernon.substack.com)

Why is my device a touchpad and a mouse and a keyboard? (who-t.blogspot.com)

Accessibility Conformance Testing (Act) Rules Format 1.1 (w3.org)

Ask HN: Would you use an agent that migrates your stack (with benchmarks)? (charter-nlpt.vercel.app)

Show HN: Spectre, a coding agent for llama.cpp servers (github.com)

Monad Annoyance (macwright.com)

In AI push, China holds the first sports event for humanoid robots (nbcnews.com)

Qwen-Image-Edit: Image Editing with Higher Quality and Efficiency (qwenlm.github.io)

Unknown object explodes in cornfield in eastern Poland (newsweek.com)

The Company Who Created "Play": The Origin of Namco (gamingalexandria.com)

Show HN: Vibe Coding：I built a website that can use multiple coding AI models (vibecoding-ai.net)

Turn Ideas into Audio Books (storybook.baby)

We built an open benchmark to test GPT-5 "safe completion"

Comments (1)