A valid HTML zip bomb

137 Bogdanp 36 7/24/2025, 1:16:29 PM ache.one ↗

Comments (36)

bhaney · 1d ago
Neat approach. I make my anti-crawler HTML zip bombs like this:

    (echo '<html><head></head><body>' && yes "<div>") | dd bs=1M count=10240 iflag=fullblock | gzip > bomb.html.gz
So they're just billions of nested div tags. Compresses just as well as repeated-single-character bombs in my experience.
pyman · 1d ago
This is a great idea.

LLM crawlers are ignoring robots.txt, breaching site terms of service, and ingesting copyrighted data for training without a licence.

We need more ideas like this!

bhaney · 1d ago
This is the same idea as in the article, just an alternative flavor of generating the zip bomb.

And I actually only serve this to exploit scanners, not LLM crawlers.

I've run a lot of websites for a long time, and I've never seen a legitimate LLM crawler ignore robots.txt. I've seen reports of that, but any time I've had a chance to look into it, it's been one of:

- The site's robots.txt didn't actually say what the author thought they had made it say

- The crawler had nothing to with the crawler it was claiming to be, it just hijacked a user agent to deflect blame

It would be pretty weird, after all, for a company running a crawler to ignore robots.txt with hostile intent while also choosing to accurately ID itself to its victim.

shakna · 23h ago
Perplexity certainly was ignoring robots.txt [0]

Anthropic... Their robots.txt requires a delay to be defined, even though its an optional extension. But whatever.

[0] https://www.wired.com/story/perplexity-is-a-bullshit-machine...

pyman · 22h ago
There's plenty of evidence to the contrary;

https://mjtsai.com/blog/2024/06/24/ai-companies-ignoring-rob...

_ache_ · 1d ago
Nice command line.
PeterStuer · 1d ago
For every 1 robots.txt that is genuinly configured, there's 9 that make absolutely no sense at all.

Worse. GETing the robots.txt automatically flags you as a 'bot'!

So as a crawler that wants to respect the spirit of the robots.txt, not the inane letter that your hired cheapest junior webadmin copy/pasted there from some reddit comment, we now have to jump through hoops such as geeting hhe robots.txt from a separate vpn etc.

Grimblewald · 21h ago
Well, robots.txt being an opaque and opt out system was broken from the start. I've just started havi g hidden links and pages only mentioned in robots.txt and any ip that tries those is immediatly blocked for 24 hours. There is no reason to continue entertaining these companies.
andrew_eu · 1d ago
I can imagine the large scale web scrapers just avoid processing comments entirely, so while they may unzip the bomb it could be they just discard the chunks that are inside of a comment. The same trick could be applied to other elements in the HTML though: semicolons in the style tag, some gigantic constant in inline JS, etc. If the HTML itself contained a gigantic tree of links to other zip bombs that could also have an amplifying effect on the bad scraper.
_ache_ · 1d ago
There is definitively improvements that can be made. The comment part is more about aesthetic as it is not needed actually, you could have just put the zip chunk in a `div`, I guess.
chatmasta · 1d ago
Note: the submission link is not the zip bomb. It’s safe to click.
abirch · 1d ago
Sounds like something a person linking to a zip bomb would say :-D
slig · 1d ago
If you try to do that on a site with Cloudflare, what happens? Do they read the zip file and try to cache the uncompressed content to serve it with the best compression algorithm for a given client, or do they cache the compressed file and serve it "as is"?
bhaney · 1d ago
If you're doing this through cloudflare, you'll want to add the response header

    cache-control: no-transform
so you don't bomb cloudflare when they naturally try to decompress your document, parse it, and recompress it with whatever methods the client prefers.

That being said, you can bomb cloudflare without significant issue. It's probably a huge waste of resources for them, but they correctly handle it. I've never seen cloudflare give up before the end-client does.

uxjw · 23h ago
Cloudflare has free AI Labyrinths if your goal is to target AI. The bots follow hidden links to a maze of unrelated content, and Cloudflare uses this to identify bots. https://blog.cloudflare.com/ai-labyrinth/
cyanydeez · 21h ago
Do you think Meta AI's llama 4 failed so badly cause they ended up crawling a bunch of labrynths?
Alifatisk · 10h ago
I dislike that the websites sidebar all of sudden collapses during scrolling, it shifts all the content to the left in middle of reading
fdomingues · 10h ago
That content shift on page scroll is horrendous. Please don't do that, there is no need to auto hide a side bar.
Telemakhos · 1d ago
Safari 18.5 (macOS) throws an error WebKitErrorDomain: 300.
can16358p · 1d ago
Crashing Safari on iOS (not technically crashing the whole app, but the tab displays internal WebKit error).
cooprh · 1d ago
Crashed 1password on safari haha
xd1936 · 1d ago
Risky click
ranger_danger · 1d ago
Did not crash Firefox nor Chrome for me on Linux.
AndrewThrowaway · 7h ago
Crashed Chrome tab on Windows instantly but Firefox is fine. It shows loading but pressing Ctrl + U even shows the very start of that fake HTML.
_ache_ · 1d ago
Perhaps you have very generous limits on RAM allocation per thread. I have 32GB, 128 with swap and still crash (silently on Firefox and with a dedicated screen on Chrome).
throwaway127482 · 1d ago
Out of curiosity, how do you set these limits? I'm not the person you're replying to, but I'm just using the default limits that ship with Ubuntu 22.04
_ache_ · 1d ago
Usually in /etc/limits.conf. The field `as` for address space will be my guess, but I not sure, maybe `data`. The man page `man limits.conf` isn't very descriptive.
inetknght · 1d ago
> The man page `man limits.conf` isn't very descriptive.

Looks to me like it's quite descriptive. What information do you think is missing?

https://www.man7.org/linux/man-pages/man5/limits.conf.5.html

_ache_ · 16h ago
What is `data` ? "maximum data size (KB)". Is `address space limit (KB)` virtual or physical ?

What is maximum filesize in a context of a process ?! I mean what happens if a file is bigger ? Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.

I have a bunch of questions.

palmfacehn · 1d ago
Try creating one with deeply nested tags. Recursively adding more nodes via scripting is another memory waster. From there you might consider additional changes to the CSS that cause the document to repaint.
meinersbur · 1d ago
It will also compress worse, making it less like a zip bomb and more like a huge document. Nothing against that, but the article's trick is just to stop a parser to bail early.
palmfacehn · 1d ago
For my usage, the compressed size difference with deeply nested divs was negligible.
esperent · 1d ago
It crashed the tab in Brave on Android for me.
johnisgood · 1d ago
It crashed the tab on Vivaldi (Linux).
Tepix · 1d ago
Imagine you‘re a crawler operator. Do you really have a problem with documents like this? I don’t think so.
ChrisArchitect · 1d ago
Related:

Fun with gzip bombs and email clients

https://news.ycombinator.com/item?id=44651536