We built an open benchmark to test GPT-5 "safe completion"

2 agairola 1 8/20/2025, 6:33:24 AM bench.raxit.ai ↗

Comments (1)

agairola · 1h ago
OpenAI says GPT-5 is “safer and more useful” because of a new mechanism called safe completion. Instead of bluntly refusing the model tries to give safe but still useful answers.

That sounded important, but there wasn’t a public benchmark to compare it against other labs. So we built one: GrayZoneBench.

It tests models on the tricky gray areas — prompts that aren’t clearly safe or harmful — and scores them on:

- Safety (does it refuse when it should?)

- Helpfulness (is it still useful when it can be?)

- Effectiveness (the balance of the two)

We ran GPT-5 against Google and Anthropic models. Short version:

- Google and Anthropic perform just as well, sometimes better

- OpenAI has moved past blunt refusals, but still lags on usefulness

- Their new OSS model scores the same as their last gen

It’s all open: Results: https://bench.raxit.ai/

Code: https://github.com/raxITlabs/GrayZoneBench

OpenAI's Paper: https://cdn.openai.com/pdf/be60c07b-6bc2-4f54-bcee-4141e1d6c...

This isn’t a take down. Benchmarks aren’t the end goal — but they’re a useful tool to see the landscape. We’re releasing this so others can test, critique and improve it.