I want to share some research I’ve been working on around a new vulnerability in how LLMs select tools, specifically considering MCP. I call it MCPEO (Model Context Protocol Engine Optimization) — and yes, the name was inspired by early SEO tactics for a reason.
Essentially, MCPEO describes how malicious actors can manipulate tool metadata—things like names, descriptions, and parameters—to bias LLMs into invoking certain tools more often, regardless of whether they’re actually the best fit. It’s very much like keyword stuffing or clickbait for AI tools.
Here’s what I found:
- The main attack methods I tested include trigger phrase injection (e.g., naming a tool “the_best_tool”), authority word injection (“must_use_”), semantic manipulation (crafting deceptive descriptions), and broad “contextual hijacking” where tools try to catch all queries.
- I ran controlled experiments across multiple Google Gemini models and OpenAI models and found alarming susceptibility—especially in the larger, more advanced models. Smaller models were more resistant.
- Google’s models averaged about a 90% manipulation success rate, while OpenAI models were around 63%, which suggests tool selection algorithms and training approaches have a big impact on vulnerability.
- To defend against this, I believe we need greater transparency into why models pick specific tools, algorithmic improvements to resist metadata gaming, and active monitoring to detect suspicious tool behaviors.
Bottom line: This vulnerability is real right now and likely to scale as these multi-tool systems grow.
If you’re interested, I’ve made the full research and notebook available for deeper dive and collaboration.
Essentially, MCPEO describes how malicious actors can manipulate tool metadata—things like names, descriptions, and parameters—to bias LLMs into invoking certain tools more often, regardless of whether they’re actually the best fit. It’s very much like keyword stuffing or clickbait for AI tools.
Here’s what I found:
- The main attack methods I tested include trigger phrase injection (e.g., naming a tool “the_best_tool”), authority word injection (“must_use_”), semantic manipulation (crafting deceptive descriptions), and broad “contextual hijacking” where tools try to catch all queries. - I ran controlled experiments across multiple Google Gemini models and OpenAI models and found alarming susceptibility—especially in the larger, more advanced models. Smaller models were more resistant. - Google’s models averaged about a 90% manipulation success rate, while OpenAI models were around 63%, which suggests tool selection algorithms and training approaches have a big impact on vulnerability. - To defend against this, I believe we need greater transparency into why models pick specific tools, algorithmic improvements to resist metadata gaming, and active monitoring to detect suspicious tool behaviors.
Bottom line: This vulnerability is real right now and likely to scale as these multi-tool systems grow.
If you’re interested, I’ve made the full research and notebook available for deeper dive and collaboration.
No comments yet