VectorSmuggle: Covertly Exfiltrate Data in Embeddings

25 smugglereal 4 6/4/2025, 8:31:37 PM github.com ↗

Comments (4)

anonymousiam · 3h ago
Well over a decade ago, I recall learning about a covert data exfiltration method that could bypass firewalls by using DNS lookups. The payload would be a base64 hostname prefix attached to an evil domain. Adding a time stamp to the prefix data would guarantee uniqueness, and get around local caching DNS servers.
acmiyaguchi · 4h ago
The idea of using stenographic techniques to exfiltrate data is interesting, but I don't quite follow the general method outlined in the repository -- either through the generated documentation or code. The threat model and case studies seem contrived. I find it hard to believe that folks would expose data via RAG that they wouldn't want users of the underlying system to be privy to.

There's too much fluff here to be useful. I imagine having something that is concise and concrete would make it more appealing to others. But as-is, it's missing a good technical summary and demonstration.

smugglereal · 2h ago
Thanks for the feedback!

It's less about the RAG exposing new data to a regular user, and more about using the vector pipeline as a covert channel. The idea is to sneak out data the attacker already can access, but in a way that might bypass traditional DLP looking at emails, USBs, etc.

The "fluff" is largely educational material, as the project is for research and learning. For a concrete technical demonstration, the scripts/embed.py and scripts/query.py scripts are the core, and the docs/guides/quick_start.md tries to offer a direct path to seeing it in action.

Hope that helps! Will add a video demo soon.

smugglereal · 8h ago
A comprehensive proof-of-concept demonstrating sophisticated vector-based data exfiltration techniques in AI/ML environments. This educational security research project illustrates potential risks in RAG systems and provides tools for defensive analysis.