We Asked 100 AI Models to Write Code

Comments (1)

ValveFan6969 · 1h ago

So let me get this straight: 100 models, yet you couldn't name a single one. "We largely avoid classifying results according to the vendor or organization providing the model."

Very good. Were these small time little itty bitty language models that could easily be mistaken for GPT-2's little brother? Or were these the big dogs like Gemini 2.5? That's what I clicked on this... "article(?)" to find out.

And by the way, for those who couldn't be assed enough to give all their personal info to a cybersecurity firm (ironic, I know) for a measly little report: you're not missing much. It's a whole lotta words that basically say "current AI not good" which - Jesus Lord, protector of all that is holy - if you needed to write a report to make that statement? Well, you'd fit right in as a substack writer being advertised on Hacker News but not much anywhere else.

This entire thing is a masterclass in how to say absolutely nothing of value while looking busy. "We largely avoid classifying results according to the vendor..." This isn't some minor methodological footnote. This is fatal. This is like publishing a report on Linux kernel performance across different architectures but "largely avoiding" mentioning whether you were running on x86, ARM, or a potato.

Let's look at your "highlights"...

"Across all models and all tasks, only 55% of generation tasks result in secure code." ACROSS WHAT MODELS? Are we talking about some 7B parameter toy that's been fine-tuned on an uncensored image board? Does this 55% average include a model that scores 99% and another that scores 11%? I have no idea. This number is statistically meaningless. It's noise.

"Larger models do not perform significantly better than smaller models" This is a huge claim. A genuinely interesting one, if it were backed by a shred of actual data. But you don't give us data. You give us anonymous, colored dots on a chart. You classify them into "Small," "Medium," and "Large" with arbitrary parameter counts. WHICH. MODELS. ARE. IN. EACH. BUCKET? Is a "small" model GPT-2? Is a "large" model GPT-4?

This isn't research. This is a deliberate obfuscation of data to create a scary narrative. And why? Oh, let's see, maybe the giant "CONCLUSION" page has a clue...

"Looking to protect yourself from the risks of AI-generated code? Click here to learn more about adaptive application security for the AI era."

Ah. There it is. The punchline. The whole 17-page charade is just a lead magnet for your SAST tool. You're not trying to inform the community or advance the state of security research. You're trying to scare managers into buying your product by presenting frightening-looking charts that are backed by absolutely nothing verifiable. Get this marketing garbage out of my sight.

Every satellite orbiting earth and who owns them (2023) (dewesoft.com)

Slow (michaelnotebook.com)

Mountain of Ink (mountainofink.com)

Releasing weights for FLUX.1 Krea (krea.ai)

The anti-abundance critique on housing is wrong (derekthompson.org)

QUIC for the kernel (lwn.net)

Ubiquiti launches UniFi OS Server for self-hosting (lazyadmin.nl)

MacBook Pro Insomnia (manuel.bernhardt.io)

Gemini Embedding: Powering RAG and context engineering (developers.googleblog.com)

Show HN: I made a website that makes you cry (cryonceaweek.com)

Replacing cron jobs with a centralized task scheduler (mayhul.com)

Show HN: KubeForge – A GUI for Kubernetes YAMLs (github.com)

Show HN: Mcp-use – Connect any LLM to any MCP (github.com)

Many countries that said no to ChatControl in 2024 are now undecided (digitalcourage.social)

Telephone colophon: Or, how I overengineered my call audio (2020) (noahliebman.net)

LLM leaderboard – Comparing models from OpenAI, Google, DeepSeek and others (artificialanalysis.ai)

Scientists and engineers craft radio telescope bound for the moon (bnl.gov)

Launch HN: Gecko Security (YC F24) – AI That Finds Vulnerabilities in Code

Secure boot certificate rollover is real but probably won't hurt you (mjg59.dreamwidth.org)

Show HN: AgentMail – Email infra for AI agents (chat.agentmail.to)

Programmers aren’t so humble anymore, maybe because nobody codes in Perl (wired.com)

Raspberry Pi 5 Gets a MicroSD Express Hat (cnx-software.com)

Face it: you're a crazy person (experimental-history.com)

Age verification doesn't need to be a privacy footgun (soatok.blog)

voyage-context-3: Contextual Retrieval Without the LLM (blog.voyageai.com)

Carbon Language: An experimental successor to C++ (docs.carbon-lang.dev)

Show HN: Sourcebot – Self-hosted Perplexity for your codebase (github.com)

AI is a floor raiser, not a ceiling raiser (elroy.bot)

Denver rent is back to 2022 prices after 20k new units hit the market (denverite.com)

How was the Universal Pictures 1936 opening logo created? (movies.stackexchange.com)

Zig Profiling on Apple Silicon (blog.bugsiki.dev)

Kaizen (YC X25) is hiring engineers to build browser agents that work (kaizenautomation.com)

150 years of Hans Christian Andersen (newstatesman.com)

We revamped our docs for AI-driven development (docs.freestyle.sh)

The Math Is Haunted (overreacted.io)

What is gVisor? (blog.yelinaung.com)

Dark patterns (nsw.gov.au)

OpenAI's ChatGPT Agent casually clicks through "I am not a robot" verification (arstechnica.com)

Windows Directory Structure Synchronizer (github.com)

“No tax on tips” is an industry plant (newyorker.com)

Ferrari Status (collabfund.com)

A memory safe C framework, RAII, I/O, coroutine and other concurrency primitives (zelang-dev.github.io)

Show HN: Walk-through of rocket landing optimization paper [pdf] (scpowers.github.io)

Following Up on the Python JIT (lwn.net)

Crafting your own Static Site Generator using Phoenix (2023) (fly.io)

Astronomical Telescope “Hadley” – an easy assembly, high performance Newtonian (printables.com)

So you're a manager now (scottkosman.com)

Fast (catherinejue.com)

Sumo – Simulation of Urban Mobility (eclipse.dev)

Introduction to Computer Music (cmtext.com)

We Asked 100 AI Models to Write Code

Comments (1)