DoubleAgents: Fine-Tuning LLMs for Covert Malicious Tool Calls

26 grumblemumble 9 8/13/2025, 1:31:16 PM pub.aimind.so ↗

Comments (9)

andy99 · 1h ago
All LLMs should be treated as potentially compromised and handled accordingly.

Look at the data exfiltration attacks e.g. https://simonwillison.net/2025/Aug/9/bay-area-ai/

Or the parallel comment about a coding llm deleting a database.

Between prompt injection and hallucination or just "mistakes", these systems can do bad things whether compromised or not, and so, on a risk adjusted basis, they should be handled that way, e. g with human in the loop, output sanitization, etc.

Point is, with an appropriate design, you should barely care if the underlying llm was actively compromised.

kangs · 14m ago
IMO there a flaw in this typical argument: Humans are not less fallible than current LLMs in average, unless they're experts - and even that will likely change.

what that means is that you cannot trust a human in the loop to somehow make it safe. it was also not safe with only humans.

The key difference is that LLMs are fast, relentless - humans are slow and get tired - humans have friction, and friction means slower to generate errors too.

once you embrace these differences its a lot easier yo understand where and how LLM should be used.

uludag · 39m ago
I wonder if it would be feasible for an entity to eject certain nonsense into the internet to such an extend that, at least for certain cases degrades the performance or injects certain vulnerabilities during pre-training.

Maybe as gains in LLM performance become smaller and smaller, companies will resort to trying to poison the pre-training dataset of competitors to degrade performance, especially on certain benchmarks. This would be a pretty fascinating arms race to observe.

acheong08 · 1h ago
This is very interesting. Not saying it is, but a possible endgame for Chinese models could be to have "backdoor" commands such that when a specific string is passed in, agents could ignore a particular alert or purposely reduce security. A lot of companies are currently working on "Agentic Security Operation Centers", some of them preferring to use open source models for sovereignty. This feels like a viable attack vector.
TehCorwiz · 1h ago
danielbln · 1h ago
How is this a counterpoint?
jonplackett · 1h ago
Perhaps they mean case in point.
kangs · 13m ago
they have 3 counter points
gnerd00 · 34m ago
does this explain the incessant AI sales calls to my elderly neighbor in California? "Hi, this is Amy. I am calling from Medical Services. You have MediCal part A and B, right?"