Open Source 1.7tb Dataset of What AI Crawlers Are Doing

Comments (1)

jauntywundrkind · 4h ago

This potentially is so awesome!

In the submission on Cloudflare adding AI blocking, one of my asks was for better tools to donate limiting (rather than add client pain with Anubis). The AI crawlers are alleged to be pretty merciless about changing their identity (IP address, user agent) if rate limited, but by having data sets like this, I feel like we stand a chance of building tools to analyze the behavior and being able to build rate limiter systems that can still function against these adversarial forces (without penalizing regular users). https://news.ycombinator.com/item?id=44443480

It'd be awesome if we had an http spec alike GitHub's rate limit headers, so that we could just tell crawlers what we'll grace them. Sure many crawlers would ignore it or try to bypass it. But there should in principle be some means for cooperation, should be a way to say what you will allow! We should be trying to coax food behaviors, but there's no protocols to set bounds for what good is. GitHub's done real good here, imo, and something like this should be enshrined, to hopefully help get server loads back to reasonable levels, to let the calm be enhanced.

Baidu to join open-source movement with Ernie 4.5 models publicly available (scmp.com)

I Scanned All of GitHub's "Oops Commits" for Leaked Secrets (trufflesecurity.com)

Microsoft to cut about 4% of jobs amid hefty AI bets (reuters.com)

Dumb phones are a dumb idea that don't live up to the 'digital detox' hype (independent.ie)

Show HN: Gothic Text Generator (gothic-text-generator.com)

Show HN: Rails Blocks – 120+ UI Components for Ruby on Rails (railsblocks.com)

Format JSON Online (formatjsononline.com)

The Guide to JSON: Syntax, Best Practices and Common Errors (indiehackers.com)

I Shipped a macOS App Built by Claude Code (indragie.com)

US to breed billions of flies to dump from aircraft, to fight flesheating maggot (theguardian.com)

Microsoft to cut up to 9k jobs as it invests in AI (bbc.co.uk)

New front door to House of Lords cost £9.6M but doesn't work (news.sky.com)

Top Bug Bounty Platforms for Ethical Hackers in 2025 (cyble.com)

New 'likely scam' tag for texts comes into effect (rte.ie)

Can voice AI help uncover the stuff your team didn't even know it knew?

We reimagined Transformer architectures inspired by nature's hidden structures (ieeexplore.ieee.org)

My 2025 Q2 Highlights: Favorite Books, Games, and TV Shows (mertbulan.com)

Show HN: ConferenceDatabase – find who sponsors any tech event (conferencedatabase.com)

Large Scale Inference (sfcompute.com)

Abrego Garcia Was Beaten and Tortured in El Salvador Prison, Lawyers Say (nytimes.com)

N-Back – A Minimal, Adaptive Dual N-Back Game for Brain Training (n-back.net)

Implementation of the Google Zero-Knowledge Library for Identity Protocols (github.com)

Show HN: Cryptr – A simple shell utility for encrypting and decrypting files (github.com)

Watch Out Bluetooth Analysis of the Coros Pace 3 (blog.syss.com)

Show HN: I made a tool to reduce customer support time by 50% (chakam.tech)

AutoMapper and MediatR Commercial Editions Launch Today (jimmybogard.com)

Search for Putnam [pdf] (static1.squarespace.com)

Show HN: Clippingmini – A minimalist tool to organize and tag web clippings (clippingmini.com)

Scientists tracking 'interstellar' object that's come from another solar system (independent.co.uk)

Numerical Electromagnics Code (NEM) (nec2.org)

I Can't Sleep (2023) (blog.paulbiggar.com)

Ask HN: What's the last non-obvious skill that made you better at your job?

G/O Media Sells Kotaku as It Winds Down Operations (nytimes.com)

Kamishibai (en.wikipedia.org)

Data Ecofeminism (dl.acm.org)

Show HN: DeepMarketScan: Counter Trade Retail Traders (deepmarketscan.com)

Foxconn mysteriously tells Chinese workers to quit India and return to China (appleinsider.com)

Maxell MXCP-P100 Cassette Tape Player Offers a Retro Look with Modern Features (hypebeast.com)

Foxconn Pulls Chinese Staff from India in Hurdle for Apple (bloomberg.com)

The world of Voronoi diagrams (2021) (fbellelli.com)

How I Built a P2P Texas Hold'em Game (Part 1: Protocol and Encryption) (medium.com)

Denial vs. OpenAI, Inc. (ND Cal 2025) [pdf] (archive.org)

Kilmar Abrego Garcia describes beatings and psychological torture in CECOT (politico.com)

Dalai Lama vows he won't be the last leader of Tibetan Buddhism (cnn.com)

OfflineInstallers – Download browser and .NET Framework offline installers (offlineinstallers.com)

Colossus: The Forbin Project (1970) [video] (archive.org)

What's wrong with AAA games? The development of the next Battlefield has answers (arstechnica.com)

Options for Handling Form Submissions Without a Back End (devmatter.app)

AetherScript: AI-Assisted Development That You Can Trust

NIH Scientists Link Air Pollution and Lung Cancer Mutations in Non-Smokers (insideclimatenews.org)

Open Source 1.7tb Dataset of What AI Crawlers Are Doing

Comments (1)