OpenAI says GPT-5 is “safer and more useful” because of a new mechanism called safe completion. Instead of bluntly refusing the model tries to give safe but still useful answers.
That sounded important, but there wasn’t a public benchmark to compare it against other labs. So we built one: GrayZoneBench.
It tests models on the tricky gray areas — prompts that aren’t clearly safe or harmful — and scores them on:
- Safety (does it refuse when it should?)
- Helpfulness (is it still useful when it can be?)
- Effectiveness (the balance of the two)
We ran GPT-5 against Google and Anthropic models. Short version:
- Google and Anthropic perform just as well, sometimes better
- OpenAI has moved past blunt refusals, but still lags on usefulness
- Their new OSS model scores the same as their last gen
This isn’t a take down. Benchmarks aren’t the end goal — but they’re a useful tool to see the landscape. We’re releasing this so others can test, critique and improve it.
That sounded important, but there wasn’t a public benchmark to compare it against other labs. So we built one: GrayZoneBench.
It tests models on the tricky gray areas — prompts that aren’t clearly safe or harmful — and scores them on:
- Safety (does it refuse when it should?)
- Helpfulness (is it still useful when it can be?)
- Effectiveness (the balance of the two)
We ran GPT-5 against Google and Anthropic models. Short version:
- Google and Anthropic perform just as well, sometimes better
- OpenAI has moved past blunt refusals, but still lags on usefulness
- Their new OSS model scores the same as their last gen
It’s all open: Results: https://bench.raxit.ai/
Code: https://github.com/raxITlabs/GrayZoneBench
OpenAI's Paper: https://cdn.openai.com/pdf/be60c07b-6bc2-4f54-bcee-4141e1d6c...
This isn’t a take down. Benchmarks aren’t the end goal — but they’re a useful tool to see the landscape. We’re releasing this so others can test, critique and improve it.