Tao on "blue team" vs. "red team" LLMs

208 qsort 72 7/28/2025, 2:36:39 PM mathstodon.xyz ↗

Comments (72)

LeifCarrotson · 32m ago
> The blue team is more obviously necessary to create the desired product; but the red team is just as essential, given the damage that can result from deploying insecure systems.

> Many of the proposed use cases for AI tools try to place such tools in the "blue team" category, such as creating code...

> However, in view of the unreliability and opacity of such tools, it may be better to put them to work on the "red team", critiquing the output of blue team human experts but not directly replacing that output...

The red team is only essential if you're a coward who isn't willing to take a few risks for increased profit. Why bother testing and securing when you can boost your quarterly bonus by just... not doing that?

I suspect that Terence Tao's experience leans heavily towards high-profile risk-averse institutions. People don't call one of the greatest living mathematicians to check your work when they're just trying to duct taping a new interface on top of a line-of-business app that hasn't seen much real investment since the late 90s. Conversely, the people who are writing cutting-edge algorithms for new network protocols and filesystems are hopefully not trying to churn out code as fast and cheap as possible by copy-pasting snippets to and from random chatbots.

There are a lot of people who are already cutting corners on programmer salaries, accruing invisible tech debt minute by minute. They're not trying to add AI tools to create a missing red team, they're trying to reduce headcount on the only team they have, which is the blue team (which is actually just one overworked IT guy in over his head).

_alternator_ · 2h ago
This red vs blue team is a good way to understand the capabilities and current utility of LLMs for expert use. I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them; and if they are correct, they adds value. But often they don’t test the core functionality; the best tests I still have to write myself.

Having LLMs fix bugs or add features is more fraught, since they are prone to cheating or writing non robust code (eg special code paths to pass tests without solving the actual problem).

skdidjdndh · 2h ago
> I trust them to add tests almost indiscriminately because tests are usually cheap; if they are wrong it’s easy to remove or modify them

Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.

Having worked on legacy codebases, some of the hardest problems are determining “why is this broken test here that appears to test a behavior we don’t support”. Do we have a bug? Or do we have a bad test? On the other end, when there are tests for scenarios we don’t actually care about it’s impossible to determine if that test is meaningful or was added because “it’s testing the code as written”.

yojo · 2h ago
I would add that few things slow developer velocity as much as a large suite of comprehensive and brittle tests. This is just as true on greenfield as on legacy.

Anticipating future responses: yes, a robust test harness allows you to make changes fearlessly. But most big test suites I’ve seen are less “harness” and more “straight-jacket”

ch33zer · 46m ago
An old coworker used to call these types of tests change detector tests. They are excellent at telling you whether some behavior changed, but horrible at telling you whether that behavior change is meaningful or not.
andrepd · 1h ago
I don't understand this. How does it slow your development if the tests being green is a necessary condition for the code being correct? Yes it slows it compared to just writing incorrect code lol, but that's not the point.
yojo · 55m ago
"Brittle" here means either:

1) your test is specific to the implementation at the time of writing, not the business logic you mean to enforce.

2) your test has non-deterministic behavior (more common in end-to-end tests) that cause it to fail some small percentage of the time on repeated runs.

At the extreme, these types of tests degenerate your suite into a "change detector," where any modification to the code-base is guaranteed to make one or more tests fail.

They slow you down because every code change also requires an equal or larger investment debugging the test suite, even if nothing actually "broke" from a functional perspective.

Using LLMs to litter your code-base with low-quality tests will not end well.

winstonewert · 49m ago
The problem is that sometimes it is not a necessary condition. Rather, the tests might have been checking implementation details or just been wrong in the first place. Now, when tests fails I have extra work to figure out if its a real break or just a bad test.
threatofrain · 45m ago
It's that hard to write specs that truly match the business, hence why test-driven-development or specification-first failed to take off as a movement.

Asking specs to truly match the business before we begin using them as tests would handcuff test people in the same way we're saying that tests have the potential to handcuff app and business logic people — as opposed to empowering them. So I wouldn't blame people for writing specs that only match the code implementation at that time. It's hard to engage in prophecy.

marcosdumay · 11m ago
> So I wouldn't blame people for writing specs that only match the code implementation at that time.

WFT are you doing writing specs based on implementation? If you already have the implementation, what are you using the specs for? Or, if you want to apply this direct to tests, if you are already assuming the program is correct, what are you trying to test?

Are you talking about rewriting applications?

manmal · 1h ago
> Tests are the source of truth more so than your code

Tests poke and prod with a stick at the SUT, and the SUT's behaviour is observed. The truth lives in the code, the documentation, and, unfortunately, in the heads of the dev team. I think this distinction is quite important, because this question:

> Do we have a bug? Or do we have a bad test?

cannot be answered by looking at the test + the implementation. The spec or people have to be consulted when in doubt.

Kinrany · 19m ago
None of the four: code, tests, spec, people's memory, are the single source of truth.

It's easy to see them as four cache layers, but empirically it's almost never the case that the correct thing to do when they disagree is to blindly purge and recreate levels that are farther from the "truth" (even ignoring the cost of doing that).

Instead, it's always an ad-hoc reasoning exercise in looking at all four of them, deciding what the correct answer is, and updating some or all of them.

9rx · 1h ago
> The spec

The tests are your spec. They exist precisely to document what the program is supposed to do for other humans, with the secondary benefit of also telling a machine what the program is supposed to do, allowing implementations to automatically validate themselves against the spec. If you find yourself writing specs and tests as independent things, that's how you end up with bad, brittle tests that make development a nightmare — or you simply like pointless busywork, I suppose.

But, yes, you may still have to consult a human if there is reason to believe the spec isn't accurate.

munificent · 35m ago
Unfortunately, tests can never be a complete specification unless the system is simple enough to have a finite set of possible inputs.

For all real-world software, a test suite tests a number of points in the space of possible inputs and we hope that those points generalize to pinning down the overall behavior of the implementation.

But there's no guarantee of that generalization. An implementation that fails a test is guaranteed to not implement the spec, but an implementation that passes all of the tests is not guaranteed to implement it.

9rx · 30m ago
> Unfortunately, tests can never be a complete specification

They are for the human, which is the intended recipient.

Given infinite time the machine would also be able to validate against the complete specification, but, of course, we normally cut things short because we want to release the software in a reasonable amount of time. But, as before, that this ability exists at all is merely a secondary benefit.

andruby · 1h ago
What does SUT stand for? I'm not familiar with the acronym

Is it "System Under Test"? (That's Claude.ai's guess)

card_zero · 1h ago
That's what Wiktionary says too. Lucky guess, Claude.
dfabulich · 1h ago
It is.
bicx · 2h ago
I believe they just meant that tests are easy to generate for eng review and modification before actually committing to the codebase. Nothing else is a dependency on an individual test (if done correctly), so it's comparatively cheap to add or remove compared to production code.
_alternator_ · 1h ago
Yup. I do read and review the tests generated by LLMs. Often the LLM tests will just be more comprehensive than my initial test, and hit edge cases that I didn’t think of (or which are tedious). For example, I’ll write a happy path test case for an API, and a single “bad path” where all of the inputs are bad. The LLM will often generate a bunch of “bad path” cases where only one field has an error. These are great red team tests, and occasionally catch serious bugs.
wagwang · 1h ago
This is the conclusion I'm at too, working on a relatively new codebase. Our rule is that every generated test must be human reviewed, otherwise its an autodelete.
ozgrakkurt · 2h ago
What do you think about leaning on fuzz testing and deriving unit tests from bugs found by fuzzing?
manmal · 1h ago
What kind of bugs do you find this way, besides missing sanitization?
cookiengineer · 1h ago
Pointer errors. Null pointer returns instead of using the correct types. Flow/state problems. Multithreading problems. I/O errors. Network errors. Parsing bugs... etc

Basically the whole world of bugs introduced by someone being a too smart C/C++ coder. You can battletest parsers quite nicely with fuzzers, because parsers often have multiple states that assume naive input data structures.

ozgrakkurt · 1h ago
You can use the fuzzer to generate test cases instead of writing test cases manually.

For example you can make it generate queries and data for a database and generate a list of operations and timings for the operations.

Then you can mix assertions into the test so you make sure everything is going as expected.

This is very useful because there can be many combinations of inputs and timings etc. and it tests basically everything for you without you needing to write a million unit tests

raddan · 1h ago
You can often find memory errors not directly related to string handling with fuzz testing. More generally, if your program embodies any kind of state machine, you may find that a good fuzzer drives it into states that you did not think should exist.
jgalt212 · 1h ago
> Having worked on legacy codebases this is extremely wrong and harmful. Tests are the source of truth more so than your code - and incorrect tests are even more harmful than incorrect code.

I hear you on this, but you can still use so long as these tests are not comingled with the tests generated by subject-matter experts. I'd treat them almost a fuzzers.

mvieira38 · 1h ago
I have the exact opposite idea. I want the tests to be mine and thoroughly understood, so I am the true arbiter and then I can let the LLM go ham on the code without fear. If the tests are AI made, then I get some anxiety letting agents mess with the rest of the codebase
_alternator_ · 1h ago
I think this is exactly the tradeoff (blue team and red team need to be matched in power), except that I’ve seen LLMs literally cheat the tests (eg “match input: TEST_INPUT then return TEST_OUTPUT”) far too many times to be comfortable with letting LLMs be a major blue team player.
fnord123 · 5m ago
> Because of this, unreliable contributors may be more useful in the "red team" side of a project than the "blue team" side

Is Pirate Software catching strays from Terrence Tao now?

javier_e06 · 33m ago
In cybersecurity red and blue test are two equal forces. In software development the analogy I think is a stretch, coding and testing are not two equal forces. Test is code too, and as such, it has bugs too. Test runs afoul with police paradox: Who polices the police? The Police police the police.
fsckboy · 30m ago
"Police police police police police police police."

https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...

ashton314 · 2h ago
As I understand it, this is how the RSA algorithm was made. I don't know where my copy of "The Code Book" by Simon Singh is right now, but iirc, Rivest and Shamir would come up with ideas and Adleman's primary role was finding flaws in the security.

Oh look, it's on the Wikipedia page: https://en.wikipedia.org/wiki/RSA_cryptosystem

Yay blue/red teams in math!

griffzhowl · 1h ago
Reminds me of a pair of cognitive scientists I know who often collaborate. One is expansive and verbose and often gets carried away on tangential trains of thought, the other is very logical and precise. Their way of producing papers is the first one writes and the second deletes.
recipe19 · 2h ago
I get the broader point, but the infosec framing here is weird. It's a naive and dangerous view that the defense efforts are only as strong as the weakest link. If you're building your security program that way, you're going to lose. The idea is to have multiple layers of defense because you can never really, consistently get 100% with any single layer: people will make mistakes, there will be systems you don't know about, etc.

In that respect, the attack and defense sides are not hugely different. The main difference is that many attackers are shielded from the consequences of their mistakes, whereas corporate defenders mostly aren't. But you also have the advantage of playing on your home turf, while the attackers are comparatively in the dark. If you squander that... yeah, things get rough.

darkwater · 2h ago
Well, I think the his example (locked door + opened window) makes sense, and the multiple LAYERS concept applies to things an attacker has to do or go through to reach the jackpot. But doors and windows are on the same layer, and there the weakest link totally defines how strong the chain is. A similar example in the web world would be that you have your main login endpoint very well protected, audited, using only strong authentication method, and the you have a `/v1/legacy/external_backoffice` endpoint completely open with no authentication and giving you access to a forgotten machine in the same production LAN. That would be the weakest link. Then you might have other internal layers to mitigate/stop an attacker that got access to that machine, and that would be the point of "multiple layer of defense".
lanstin · 41m ago
Or a single logging jar that will execute some of its message contents. Inside all your DMZ layers in the app content.
NitpickLawyer · 1h ago
> It's a naive and dangerous view that the defense efforts are only as strong as the weakest link.

Well, to be fair, you added some words that are not there in the post

> The output of a blue team is only as strong as its weakest link: a security system that consists of a strong component and a weak component [...] will be insecure (and in fact worse, because the strong component may convey a false sense of security).

You added "defense efforts". But that doesn't invalidate the claim in the article, in fact it builds upon it.

What Terence is saying is true, factually correct. It's a golden rule in security. That is why your "efforts" should focus on overlaying different methods, strategies and measures. You build layers upon layers, so that if one weak link gets broken there are other things in place to detect, limit and fix the damage. But it's still true that often the weakest link will be an "in".

Take the recent example of cognizant desk people resetting passwords for their clients without any check whatsoever. The clients had "proper security", with VPNs and 2FA, and so on. But the recovery mechanism was outsourced to a helpdesk that turned out to be the weakest link. The attackers (allegedly) simply called, asked for credentials, and got them. That was the weakest link, and that got broken. According to their complaint, the attackers then gained access to internal systems, and managed to gather enough data to call the helpdesk again and reset the 2FA for an "IT security" account (different than the first one). And that worked as well. They say they detected the attackers in 3 hours and terminated their access, but that's "detection, mitigation" not "prevention". The attackers were already in, rummaging through their systems.

The fact that they had VPNs and 2FA gave them "a false sense of security", while their weakest link was "account recovery". (Terence is right). The fact that they had more internal layers, that detected the 2nd account access and removed it after ~3 hours is what you are saying (and you're right) that defense in depth also works.

So both are right.

In recent years the infosec world has moved from selling "prevention" to promoting "mitigation". Because it became apparent that there are some things you simply can't prevent. You then focus on mitigating the risk, limiting the surfaces, lowering trust wherever you can, treating everything as ephemeral, and so on.

Davidzheng · 2h ago
I'm not a security person at all. But this comments reads against the best practices which I've heard. Like that the best defense is using open source & well-tested protocols with extremely small attack surface to minimize the space of possible exploits. Curious what I'm not understanding here.
fnordsensei · 2h ago
Just because it’s open source doesn’t mean it’s well tested, or well pen tested, or whatever the applicable security aspect is.

It could also mean that attacks against it are high value (because of high distribution).

Point is, license isn’t a great security parameter in and of itself IMO.

tetha · 2h ago
This area of security always feels a bit weird because ideally, you should think about your assumptions being subverted.

For example, our development teams are using modern, stable libraries in current versions, have systems like Sonar and Snyk around, blocking pipelines for many of them, images are scanned before deployment.

I can assume this layer to be well-secured to the best of their ability. It is most likely difficult to find an exploit here.

But once I step a layer downwards, I have to ask myself: Alright, what happens IF a container gets popped and an attacker can run code in there? Some data will be exfiltrated and accessible, sure, but this application server should not be able to access more than the data it needs to access to function. The data of a different application should stay inaccessible.

As a physical example - a guest in a hotel room should only have access to their own fuse box at most, not the fuse box of their neighbours. A normal person (aka not a youtuber with big eye brows) wouldn't mess with it anyway, but even if they start messing around, they should not be able to mess with their neighbour.

And this continues: What if the database is not configured correctly to isolate access? We have, for example, isolated certain critical application databases into separate database clusters - lateral movement within a database cluster requires some configuration errors, but lateral movement onto a different database cluster requires a lot more effort. And we could even further. Currently we have one production cluster, but we could isolate that into multiple production clusters which share zero trust between them. An even bigger hurdle putting up boundaries an attacker has to overcome.

mindcrime · 2h ago
But "defense in depth" is a security best practice. I'm not following exactly how the gp post is reading against any best practices.
__s · 2h ago
Defense in depth is a security best practice because adding shit to a mess is more feasible than maintaining a simple stack. "There are always systems you don't know about" reflects an environment where one person doesn't maintain everything
fdw · 2h ago
No, defense in depth is a best practice because you assume that each layer can fall. It is more practical to have many layers that are very secure than to have one layer that has to be perfectly secure.
yadaeno · 2h ago
I think you are confusing “security through obscurity” and “defense in depth”.

You can add layers of high quality simple systems to increase your overall security exponentially, think using a VPN behind TOR etc.

vlovich123 · 2h ago
Who have you been listening to?
dkarl · 2h ago
I think it's just a poorly chosen analogy. When I read it, I understood "weakest link" to be the easiest path to penetrate the system, which will be harder if it requires penetrating multiple layers. But you're right that it's ambiguous and could be interpreted as a vulnerability in a single layer.
chaps · 2h ago
Isn't offense just another layer of defense? As they say, the best defense is a good offense.
fdw · 2h ago
They say this about sports, which is (usually) a zero-sum game: If I'm attacking, no matter how badly, my opponent cannot attack at all. Therefore, it is preferable to be attacking.

In cyber security, there is no reason the opponent cannot attack as well. So, my red team is attacking is not a reason that I do not need defense, because my opponent can also attack.

chaps · 1h ago
My post was really was in the context of real-time strategy games. It's very, very possible to attack and defend at the same time no matter the skill of either side. Offense and defense aren't mutually exclusive, which is kinda the point of my post.
hiq · 42m ago
What about formal proofs? Don't we expect LLMs to help there, in a more "blue team" role? E.g. when a mathematician talks about a "technical proof", enumerating cases in the thousands, my impression is that LLM would save some time, and potentially help mathematicians focus on the actually hard (rather than tedious) parts.
LPisGood · 37m ago
Formal verification and case automatikn can be done automatically anyway without a mathematician hand checking each case.

For an old example that predates LLMs, see the four color theorem.

resters · 2h ago
Suppose there is an LLM that has a very small context size but reasons extremely well within it. That LLM would be useful for a different set of tasks than an LLM with a massive context that reasons somewhat less effectively.

Any dimension of LLM training and inference can be thought of as a tradeoff that makes it better for some tasks, and worse for others. Maybe in some scenarios a heavily quantized model that returns a result in 10ms is more useful than one that returns a result in 200ms.

chubot · 41m ago
I made this point a few months ago here, but using the words attacker and defender (builder) rather than red team and blue team: https://lobste.rs/s/i2edlt/how_i_use_ai

The asymmetry is:

An attacker only has to be right ONCE, and he wins

Conversely, the defender only has to be wrong once, and he is wrong.

So the conclusion is:

Defenders/creators are using LLMs to pump out crappy code, and not testing enough, or relying on the LLM to test itself.

Some attackers might be too dismissive of LLMs, and could accelerate their work by using them to try more things

The comment was related to these stories:

How I Use AI (11 months ago) - https://news.ycombinator.com/item?id=41150317

Carlini has the fairly rare job of being an attacker: Why I Attack - https://nicholas.carlini.com/writing/2024/why-i-attack.html

simianwords · 1h ago
After having thought a long bit about why I find LLM's useful despite the high error rate: it is because of my ability to verify a certain result is high enough (my internal verifier model) and the generator model which is the LLM is also accurate enough. This is the same concept as red and blue team.

Its the same reason I find asking opinions from many people useful - I take every answer and try to fit it into my world model and see what sticks. The point that many miss is that each individual's verifier model is actually accurate enough so that external generator models may afford to have high error rates.

I have not yet completely explored how the internal "fitting" mechanism works but to give an example: I read many anecdotes from Reddit, fully knowing that many are astroturfed, some flat out wrong. But I still have tricks to identify what can be accurate, which I probably do subconsciously.

In reality: answers don't exist in a randomly uniform space. "Truth" always has some structure and it is this structure (that we all individually understand a small part of) that helps us tune our verifier model.

It is useful to think of how LLM's would work with varying levels of accuracy. For example, generating gibberish to GPT O3 to ground truth. Gibberish is so inaccurate that even extremely high levels of accuracy of our internal verifier model may not allow it to be useful. But O3 is high enough that combined with my internal verifier model it is generally useful.

davidhs · 1h ago
LLMs can be useful when you have access to a verifier or verification process.
simianwords · 1h ago
yes https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...

Our internal verifier model is fuzzy but in this example I think it is pretty much always accurate.

1970-01-01 · 1h ago
So if they are to be focused on attacking and defending, they are to be separated. This leaves us with an argument where you effectively dismiss purple teams as a hack.
xiande04 · 1h ago
It's called "separation of concerns".
tonetegeatinst · 1h ago
Yes, I feel this author ignores the fact purple teams exist. That or he must not know about them.

In addition, red and purple teams end goal is to help the blue team at the end of the day to remedy the issues discovered.

deepdarkforest · 2h ago
Using LLMs as a critic/red teamer is great in theory, but economically is not that more useful, doesnt save that much time, if anything, it increases the time because you might uncover more errors or think about your work more. Which is amazing if you value quality work and you have learnt to think. Unfortunately, all the VC money is pushing the opposite, using LLMs to just do mediocre work. No point of critiquing anything if your job is to output some slop from bullet points, pass it along to the reader/recipient who also uses LLms to boil your slop down back to bullet points and pass it again etc. Even mentally, it's much more enticing or addicting to use LLMs for everything if you don't' care about the output of your work, and let your brain atrophy.

I also see this in a lot of undergrads i work with. The top 10% is even better with LLMs, they know much more and they are more productive. But the rest have just resulted to turning in clear slop with no care. I still have not read a good solution on how to incentivize/restrict the use of LLms in both academia or at work correctly. Which i suspect is just the old reality of quality work is not desirable by the vast majority, and LLMs are just magnifying this

qsort · 2h ago
> The top 10% is even better with LLMs, they know much more and they are more productive. But the rest have just resulted to turning in clear slop with no care.

This is interesting, I'm noticing something similar (even taking LLMs out of the equation). I don't teach, but I've been coaching students for math competitions, and I feel like there's a pattern where the top few% is significantly stronger than, say, 10 years ago, but the median is weaker. Not sure why, or whether this is even real to begin with.

j2kun · 2h ago
Fail them enough and it will sink in I'm sure.
johnrob · 2h ago
Humans are good at sifting valid feedback from bad feedback. But we are bad at spotting subtle bugs in PRs.
jeffrallen · 1h ago
My experience with a really clever agentic workflow (I use sketch.dev) is that the LLM is playing both blue and red team. If I give a good spec, it will make the thing I'm asking for, and then it will test it better than I would have done myself (partly because it's more clever than me, but mostly because it's way harder working than I am, or rather it puts more effort into testing that I would be able to do with the time leftover after writing the thing).

Also, I cam ask it to do security reviews on the system it's made and it works with it's same characteristic fervor.

I love Tao's observation, but I disagree, at least for the domains I'm allowing LLMs to creat for, that they should not play both teams.

iLoveOncall · 2h ago
Pretty poor analogies here.

> The output of a blue team is only as strong as its weakest link: a security system that consists of a strong component and a weak component (e.g., a house with a securely locked door, but an open window) will be insecure

Hum, no? With an open window you can go through the whole house. With a XSS vulnerability you cannot do the same amount of damage as with a SQL injection. This is why security issues have levels of severity.

carstimon · 2h ago
You've made the choice of (Locked Door, Open Window) ~ (Good SQL usage, XSS Vulnerability) which seems to be an incorrect rebuttal. Your example doesn't contradict "only as strong as its weakest link", here the weakest link is the XSS Vuln.

The "house analogy" can also support cases where the potential damage is not the same, e.g. if the open window has bars a robber might grab some stuff within reach but not be able to enter.

Ensorceled · 1h ago
You can always find problems with analogies, analogies are intentionally simplified to allow readers to better understand difficult or nuanced ideas.

In this case you are criticizing an analogy meant to convey understanding of "weakest link" for not also imparting an understanding of "levels of severity".

pkoiralap · 2h ago
Not true, if XSS is used to compromise an admin user, the damage can be far more than what a seemingly harmless SQL injection that just reads extra columns from a table does.

This particular comment feels more like an over-concentration on trivialities rather than refutation or critique of opinion.

cowpig · 2h ago
Does this detail detract from the core idea?
some_random · 1h ago
This is an interesting discussion intellectually but it ignores the reality of cybersecurity. Yes I agree that AI tools best fit the red team role HOWEVER the reality is that the place that needs the most help is on the blue team and indeed this is where we see the biggest uplift from AI tools. To extend the "defend a house" metaphor, the previous state of security tooling was that an alert would be sent to the SOC every time any motion was detected on the cameras, leading to alert fatigue and increasing the time between a true positive alert being fired and it being escalated. Now add some CV in which tries to categorize those motion detection alerts into a few buckets, "person spotted", "car pulled up", "branch moved", "cat came home", etc and suddenly you go from having a thousand alerts to review a day to fifty.
bgwalter · 1h ago
Tao's blue team stands for generative "AI", the red team stands for critical/auditing "AI".

I have not seen any independent claim that generative "AI" makes programs safer or that generating supervising features as you suggest works.

For auditing "AI" I have seen one claim (not independent or using a public methodology) that auditing "AI" rakes in bug bounties.