Whole-genome ancestry of an Old Kingdom Egyptian (nature.com)

A nice attempt and another layer for the swiss cheese of technology it will take to try and ease the burden AI companies are putting on people trying to run websites.

I'd be cautious about relying on just the good will of Cloudflare.

It's unfortunate that we need honeypots and tarpits to trap AI scrapers just so that our hosting bills don't get hosed. It's taking a good chunk of value out of running a site on the Internet.

OutOfHere · 9h ago

Feel free to waste your expensive outgoing bandwidth running malware. It is a genius idea really from the cloud companies to enrich their balances.

Definitely don't rewrite your web server more efficiently in Rust instead. /s

Retric · 8h ago

Serving poisoned text can be so cheap it’s effectively free as long as you don’t give them a lot of links.

Mars008 · 2h ago

Yeh, and say goodby to google search. You didn't want to be there anyway, right?

techjamie · 6h ago

Many of these tarpits deliberately serve the data at an excruciatingly low speed to ease the burden on the server resources. It's cheaper than quickly serving the same crawlers your entire website at max speed constantly.

OutOfHere · 2h ago

If we are going for cheaper, how is it cheaper than an HTTP 429 error? It's not.

TekMol · 9h ago

Currently, what I do is that when an IP requests insane amounts of URLs on my server (especially when its all broken urls causing 404s) I look up the IP and then block the whole organization.

For example today some bot from the range 14.224.0.0-14.255.255.255 got crazy and caused a storm of 404s. Dozens per second for hours on end. So I blocked the range like this:

iptables -A INPUT -m iprange --src-range 14.224.0.0-14.255.255.255 -j DROP

That's probably not the best way and might block significant parts of whole countries. But at least it keeps my service alive for now.

What do others here do to protect their servers?

PaulDavisThe1st · 8h ago

At git.ardour.org, we block any attempt to retrieve a specific commit. Trying to do so triggers fail2ban putting the IP into blocked status for 24hrs. They also get a 404 response.

We wouldn't mind if bots simply cloned the repo every week or something. But instead they crawl through the entire reflog. Fucking stupid behavior, and one that has cost us an extra $50/month even with just the 404.

GGO · 8h ago

I like rate-limiting. I know none of my users will need more than 10qps. I set that for all routes, and all bots get throttled. I can also have much higher rate-limit for authenticated users. Have not had bots slamming me - they just get 429s

azangru · 9h ago

> Cloudflare, along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers

How is this done, technically? User agent checking? IP range blocking?

yablak · 9h ago

Details here: https://blog.cloudflare.com/introducing-pay-per-crawl/

grg0 · 9h ago

This requires good faith on behalf of the crawler? So it's DOA; why even bother implementing this?

Also, what a piece of zero-trust shit the web is becoming thanks to a couple of shit heads who really need to extract monetary value out of everything. Even if this non-solution were to work, the prospect of putting every website behind Cloudsnare is not a good one anyway.

What the web needs right now, to be honest, is machetes. In ample quantity. Tell me who's running that crawler that is bothering you and I will put them to the sword. They won't even need to present a JWK in the header.

xg15 · 8h ago

Maybe I didn't understand the proposal completely yet, but wouldn't the crawler only have to cooperate (send the right headers, implement that auth framework, etc) if they want to pay?

The standard response to a crawler is a 402 Payment Required response, probably as a result of an aggressive bot detection.

So essentially, it's turning a site's entire content into an API: Either sign up for an API key or get blocked.

The question remains though how well they will be able to distinguish bot traffic from humans - also, will they make an exception for search engines?

grg0 · 6h ago

That is not what I understood, and it sounds terrible. What if you're not a crawler but random Joe surfing the internet? Clearly Joe should see content without payment? So they need some way to tell the crawler and Joe apart, and presumably they require the crawler to set certain request headers. The headers aren't just to issue the payment, it's to identify the crawler in the first place?

AkshatM · 4h ago

Joe will be fine. Cloudflare is pretty good at differentiating humans from bot traffic - see how we do it here: https://developers.cloudflare.com/turnstile/

The idea behind the headers is to allow bots to bypass automatic bot filtering, not blockade all regular traffic. In other words:

- we block bots (the website owner can configure how aggressively we block) - unless they say they're from an AI crawler we've vetted, as attested by the signature headers - in which case we let them pay - and then they get to access the content

(Disclosure: I wrote the web bot auth implementation Cloudflare uses for pay per crawl)

xg15 · 5h ago

The writeup doesn't talk about actively misbehaving crawlers a lot, but this bit implies for me that the headers are for the "happy path", I.e. crawlers wanting to pay:

> Each time an AI crawler requests content, they either present payment intent via request headers for successful access (HTTP response code 200), or receive a 402 Payment Required response with pricing.

I don't see how it would make sense otherwise, as the requirements for crawlers include applying for a registration with Cloudflare.

Who in their right mind would jump through registration hoops only so they can not access a site? This wouldn't even keep away the crawlers that are operating today.

I agree there has to be some way to distinguish crawlers from regular users, but the only way I can see how this could be done is with bot detection algorithms.

...which are imperfect and will likely flag some legitimate human users as bots. So yes, this will probably leading to web browsing becoming even more unpleasant.

cryptonector · 8h ago

It's Cloudflare. That means they are good at DoS and DDoS protection. AI crawlers are basically DoS agents. I think CF can start with an honor system that also has attached to it the implied threat to block crawlers from all CF hosted content, and that is a pretty big hammer to hit the abusers with.

So I'm cautiously optimistic. Well, I suppose pessimistic too: if this works what this will mean is that all contents will end up moving into big player hosting like CF.

rorylaitila · 9h ago

It's unfortunate but I think the ship has sailed. Good on them for trying but I don't see it working.

I am advising all my clients away from informational content which is easily remixed by LLMs. And I'm not bothering anymore with targeting informational search queries on my own sites.

I'm doubling down on community and interaction. Finding ways to interact with original content with smaller audiences, rather than produce information for a global search audience.

mhuffman · 8h ago

So are they going to try and IP gate them or trust that AI companies that literally stole the info they used to make the base models will now respect robots.txt entries?

trhway · 8h ago

Every one likes net neutrality when the one is benefitting from it, yet the one immediately jumps at the opportunity to break net neutrality on their services if it allows to increase profit by price discrimination (which may take a shape of extracting a rent from some subset of consumers like seems to be in this case) .

mzs · 9h ago

Is there a cut that cloudfare gets or is that behind an NDA?

yladiz · 8h ago

Would this be preferable to something like Anubis?

jmole · 9h ago

> Imagine an AI engine like a block of swiss cheese. New, original content that fills one of the holes in the AI engine’s block of cheese is more valuable than repetitive, low-value content that unfortunately dominates much of the web today.

Great statement in theory - but in practice, the whole people-as-a-service industry for AI data generation is IMO more damaging to the knowledge ecosystem than open data. e.g. companies like pareto.ai

"Proprietary data for pennies on the dollar" is the late-stage capitalism equivalent of the postdoctoral research trap.

ChrisArchitect · 8h ago

Discussion:

Cloudflare to introduce pay-per-crawl for AI bots

https://news.ycombinator.com/item?id=44432385

tiahura · 8h ago

I thought the web was supposed to be free and open?

Havoc · 8h ago

That ship has sailed

ramesh31 · 8h ago

This will play out precisely like the "do not track" header; bad actors will create an arms race that makes anyone respecting it into a chump.

teeray · 8h ago

If the only way to escape a proof of work tarpit is to pay the toll, you’re either going to pay in money or time & compute.

accountforih · 10h ago

I don’t understand, companies that want to crawl can pay for services like brightdata or crawlbase, the barriers don’t apply to them

This ends up hurting individuals and small companies that are harmless and cannot afford to pay

Whole-genome ancestry of an Old Kingdom Egyptian (nature.com)

Trans-Taiga Road:The farthest you can get from a town on a road in North America (jamesbayroad.com)

Exploiting the IKKO Activebuds “AI powered” earbuds (2024) (blog.mgdproductions.com)

More Efficient Thermoelectric Cooling (jhuapl.edu)

Don’t use “click here” as link text (2001) (w3.org)

ASCIIMoon: The moon's phase live in ASCII art (asciimoon.com)

That XOR Trick (2020) (florian.github.io)

Couchers is officially out of beta (couchers.org)

Show HN: CSS generator for a high-def glass effect (glass3d.dev)

What to build instead of AI agents (decodingml.substack.com)

Vitamin C Boosts Epidermal Growth via DNA Demethylation (jidonline.org)

Physicists Start to Pin Down How Stars Forge Heavy Atoms (quantamagazine.org)

AI note takers are flooding Zoom calls as workers opt to skip meetings (washingtonpost.com)

The Zen of Quakerism (2016) (friendsjournal.org)

Gene therapy restored hearing in deaf patients (news.ki.se)

The Evolution of Caching Libraries in Go (maypok86.github.io)

Sony's Mark Cerny Has Worked on "Big Chunks of RDNA 5" with AMD (overclock3d.net)

Features of D That I Love (bradley.chatha.dev)

A Higgs-Bugson in the Linux Kernel (blog.janestreet.com)

Conversations with a Hit Man (magazine.atavist.com)

Websites hosting major US climate reports taken down (apnews.com)

A list is a monad (alexyorke.github.io)

Evidence of a 12,800-year-old shallow airburst depression in Louisiana (scienceopen.com)

MindsDB (YC W20) is hiring an AI solutions engineer (job-boards.greenhouse.io)

Escher's art and computer science (github.com)

Private sector lost 33k jobs, badly missing expectations of 100k increase (cnbc.com)

Efficient set-membership filters and dictionaries based on SAT (github.com)

WebAssembly Troubles part 4: Microwasm (2019) (troubles.md)

More assorted notes on Liquid Glass (morrick.me)

Cloudflare Introduces Default Blocking of A.I. Data Scrapers (nytimes.com)

The Unseen Fury of Solar Storms (noemamag.com)

The "personal computer" model scales better than the "terminal" model (utcc.utoronto.ca)

Reuleaux Kinematic Mechanisms Collection (digital.library.cornell.edu)

NYT to start searching deleted ChatGPT logs after beating OpenAI in court (arstechnica.com)

Nightmares Linked to Faster Ageing and Premature Mortality (emjreviews.com)

BCPL (2022) (cl.cam.ac.uk)

TikTok is being flooded with racist AI videos generated by Google's Veo 3 (arstechnica.com)

A proof-of-concept neural brain implant providing speech (arstechnica.com)

Meet Bionode (microship.com)

What I learned gathering nootropic ratings (2022) (troof.blog)

Hexagon fuzz: Full-system emulated fuzzing of Qualcomm basebands (srlabs.de)

Huawei releases an open weight model trained on Huawei Ascend GPUs (arxiv.org)

Azure API vulnerability and roles misconfiguration compromise corporate networks (token.security)

Super Simple "Hallucination Traps" to detect interview cheaters

Spider-Robot for Surgical Interventions (2024) (neosciencehub.com)

I'm dialing back my LLM usage (zed.dev)

Sparsity Is Cool (tilderesearch.com)

Chatbot Flow Editor – Visual tool for designing conversation flows (github.com)

Flint, Michigan replaces most lead pipes 10 years after Michigan water crisis (nbcnews.com)

How large are large language models? (gist.github.com)

Content Independence Day: no AI crawl without compensation

Comments (29)