We Asked 100 AI Models to Write Code

10 todsacerdoti 1 8/1/2025, 1:17:41 AM veracode.com ↗

Comments (1)

ValveFan6969 · 1h ago
So let me get this straight: 100 models, yet you couldn't name a single one. "We largely avoid classifying results according to the vendor or organization providing the model."

Very good. Were these small time little itty bitty language models that could easily be mistaken for GPT-2's little brother? Or were these the big dogs like Gemini 2.5? That's what I clicked on this... "article(?)" to find out.

And by the way, for those who couldn't be assed enough to give all their personal info to a cybersecurity firm (ironic, I know) for a measly little report: you're not missing much. It's a whole lotta words that basically say "current AI not good" which - Jesus Lord, protector of all that is holy - if you needed to write a report to make that statement? Well, you'd fit right in as a substack writer being advertised on Hacker News but not much anywhere else.

This entire thing is a masterclass in how to say absolutely nothing of value while looking busy. "We largely avoid classifying results according to the vendor..." This isn't some minor methodological footnote. This is fatal. This is like publishing a report on Linux kernel performance across different architectures but "largely avoiding" mentioning whether you were running on x86, ARM, or a potato.

Let's look at your "highlights"...

"Across all models and all tasks, only 55% of generation tasks result in secure code." ACROSS WHAT MODELS? Are we talking about some 7B parameter toy that's been fine-tuned on an uncensored image board? Does this 55% average include a model that scores 99% and another that scores 11%? I have no idea. This number is statistically meaningless. It's noise.

"Larger models do not perform significantly better than smaller models" This is a huge claim. A genuinely interesting one, if it were backed by a shred of actual data. But you don't give us data. You give us anonymous, colored dots on a chart. You classify them into "Small," "Medium," and "Large" with arbitrary parameter counts. WHICH. MODELS. ARE. IN. EACH. BUCKET? Is a "small" model GPT-2? Is a "large" model GPT-4?

This isn't research. This is a deliberate obfuscation of data to create a scary narrative. And why? Oh, let's see, maybe the giant "CONCLUSION" page has a clue...

"Looking to protect yourself from the risks of AI-generated code? Click here to learn more about adaptive application security for the AI era."

Ah. There it is. The punchline. The whole 17-page charade is just a lead magnet for your SAST tool. You're not trying to inform the community or advance the state of security research. You're trying to scare managers into buying your product by presenting frightening-looking charts that are backed by absolutely nothing verifiable. Get this marketing garbage out of my sight.