Open Source 1.7tb Dataset of What AI Crawlers Are Doing

10 catsanddogsart 1 7/3/2025, 12:33:07 AM huggingface.co ↗

Comments (1)

jauntywundrkind · 4h ago
This potentially is so awesome!

In the submission on Cloudflare adding AI blocking, one of my asks was for better tools to donate limiting (rather than add client pain with Anubis). The AI crawlers are alleged to be pretty merciless about changing their identity (IP address, user agent) if rate limited, but by having data sets like this, I feel like we stand a chance of building tools to analyze the behavior and being able to build rate limiter systems that can still function against these adversarial forces (without penalizing regular users). https://news.ycombinator.com/item?id=44443480

It'd be awesome if we had an http spec alike GitHub's rate limit headers, so that we could just tell crawlers what we'll grace them. Sure many crawlers would ignore it or try to bypass it. But there should in principle be some means for cooperation, should be a way to say what you will allow! We should be trying to coax food behaviors, but there's no protocols to set bounds for what good is. GitHub's done real good here, imo, and something like this should be enshrined, to hopefully help get server loads back to reasonable levels, to let the calm be enhanced.