Show HN: Eyesite – Experimental website combining computer vision and web design (blog.andykhau.com)

esp. for image data libraries, why not provide the images as a dump instead? No need to crawl 3mil images if the download button is right there. Now put the file on a cdn or Google and you're golden

HumanOstrich · 1d ago

There are two immediate issues I see with that. First, you'll end up with bots downloading the dump over and over again. Second, for non-trivial amounts of data, you'll end up paying the CDN for bandwidth anyway.

throwawayscrapd · 1d ago

I work on the kind of big online scientific database that this article is about.

100% of our data is available from a clearly marked "Download" page.

We still have scraper bots running through the whole site constantly.

We are not "golden".

atonse · 1d ago

How was this not a problem before with search engine crawlers?

Is this more of an issue with having 500 crawlers rather than any single one behaving badly?

Ndymium · 1d ago

Search engine crawlers generally respected robots.txt and limited themselves to a trickle of requests, likely based on the relative popularity of the website. These bots do neither, they will crawl anything they can access and send enough requests per second to drown your server, especially if you're a self hoster running your own little site on a dinky server.

Search engines never took my site down, these bots did.

atonse · 1d ago

Thanks for specifying the actual issue. As someone who hosts a bunch of sites, we're also seeing a spike in traffic, but we don't track user agents.

OutOfHere · 16h ago

Maybe stop using an inefficient PHP/Javascript/Typescript server, and start using a more efficient Go/Rust/Nim/Zig server.

Ndymium · 9h ago

Personally I'm specifically talking about Forgejo which is Go, but calls git for some operations. And the effect that was worse than pegging all the CPUs to 100% was filling of the disk with generated zip archives of all of the commits of all public repositories.

Sure, we can say that Forgejo should have had better defaults for this (the default was to clear archives after 24 hours). And that your site should be fast, run on an efficient server, and not have any even slightly expensive public endpoints. But in the end that is all victim blaming.

One of the nice parts of the web for me is that as long as I have a public IP address, I can use any dinky cheapo server I have and run my own infra on it. I don't need to rely on big players to do this for me. Sure, sometimes there's griefers/trolls out there, but generally they don't bother you. No one was ever interested in my little server, and search engines played fair (and to my knowledge still do) while still allowing my site to be discoverable.

Dealing with these bots is the first time my server has been consistently attacked. I can deal with them for now, but it is an additional thing to deal with and suddenly this idea of easy self hosting on low powered hardware is no longer so feasible. That makes me sad. I know what I should do about it, but I wish I didn't have to.

OutOfHere · 3h ago

It is why I require authorization for expensive endpoints. Everything else can often be just an inexpensive cache hit.

OutOfHere · 1d ago

Requiring PoW (proof-of-work) could take over for simple requests, rejecting requests until a sufficient nonce is included in the request. Unfortunately, this collective PoW could burden power grids even more, wasting energy+money+computation for transmission. Such is life. It would be a lot better to just upgrade the servers, but that's never going to be sufficient.

Bjartr · 1d ago

So, Anubis?

https://anubis.techaro.lol/

OutOfHere · 1d ago

Yes, although the concept is simple enough in principle that a homegrown solution also works.

Zardoz84 · 1d ago

We are wasting power on feeding statistics parrots, and we need to waste additional power to avoid being DoS by that feeding.

We will be better without that useless waste of power.

treyd · 1d ago

What do you suppose we as website owners do to prevent our websites from being DoSed in the meantime? And how do you suppose we convince/beg the corporations running AI scraping bots to be better users of the web?

OutOfHere · 16h ago

How did you manage search engine crawlers for the past few decades? And why are AI crawlers functionally different? They aren't.

jaoane · 1d ago

Write proper websites that do not choke that easily.

HumanOstrich · 1d ago

So I just need a solution with infinite compute, storage, and bandwidth. Got it.

jaoane · 1d ago

That is not what I said and that is not what is necessary.

First of all web developers should use google and learn what a cache is. That way you don’t need compute at all.

throwawayscrapd · 1d ago

And maybe you could Bing and learn what "cache eviction" is and why that happens when a crawler systematically hits every page on your site.

OutOfHere · 16h ago

Maybe because it's an overly simplistic LRU cache, in which case a different eviction algorithm would be better.

It's funny really since Google and other search engines have been crawling sites for decades, but now that search engines have competition, sites are complaining.

OutOfHere · 1d ago

This should be an easy question for an engineer. It depends on whether the constraint is CPU or memory or database or network.

zihotki · 1d ago

Technology can't solve a human problem, the constraints are in budgets and in available time

OutOfHere · 14h ago

As of this year, AI has given people superpowers, doubling what they can achieve without it. Is this gain not enough? One can use it to run a more efficient web server.

OutOfHere · 16h ago

What human problem. Do tell -- how have sites handled search engine crawlers for the past few decades? Why are AI crawlers functionally different? It makes no sense because they aren't functionally different.

Show HN: Spark, An advanced 3D Gaussian Splatting renderer for Three.js (sparkjs.dev)

Show HN: Open sourcing my developer portfolio (github.com)

Show HN: Eyesite – Experimental website combining computer vision and web design (blog.andykhau.com)

Show HN: RomM – An open-source, self-hosted ROM manager and player (github.com)

Show HN: DIY virtual HDMI monitor using "AR" glasses (github.com)

Show HN: Tritium – The Legal IDE in Rust (tritium.legal)

Show HN: Ikuyo a Travel Planning Web Application (ikuyo.kenrick95.org)

Show HN: S3mini – Tiny and fast S3-compatible client, no-deps, edge-ready (github.com)

Show HN: A “Course” as an MCP Server (mastra.ai)

Show HN: I made a 3D printed VTOL drone (tsungxu.com)

Show HN: The Roman Industrial Revolution that could have been (thelydianstone.com)

Show HN: I built an app to calculate body type and generate AI fitness plans (mybodytype.net)

Show HN: High End Color Quantizer (github.com)

Show HN: Apple's new "Liquid Glass" effect as a Svelte component (npmjs.com)

Show HN: Most users won't report bugs unless you make it stupidly easy

Show HN: Interactive Enigma Machine Simulator (enigmasimulator.com)

Show HN: Chili3d – A open-source, browser-based 3D CAD application

Show HN: PyDoll – Async Python scraping engine with native CAPTCHA bypass (github.com)

Show HN: MidWord – A Word-Guessing Game (midword.com)

Show HN: Munal OS: a graphical experimental OS with WASM sandboxing (github.com)

Show HN:I made a word translation plugin for language learning.

Show HN: AuraCoder – Gen AI Learning Platform (auracoder.com)

Show HN: I created an AI search engine for the Quebec Civil Code (codex-nuvia.ca)

Show HN: An open-source rhythm dungeon crawler in 16 x 9 pixels (github.com)

Show HN: Let’s Bend – Open-Source Harmonica Bending Trainer (letsbend.de)

Show HN: Glowstick – type level tensor shapes in stable rust (github.com)

Show HN: Turn your YT videos into AI-tutor (youtube.com)

Show HN: Somo – a human friendly alternative to netstat (github.com)

Show HN: Update to my meta glasses API "Hey Meta send a message to ChatGPT" (github.com)

Show HN: I am making an app to rival "Everything" (drimiteros.github.io)

Show HN: Rapidez – Headless Magento with Laravel and InstantSearch (rapidez.io)

Show HN: UserWatch – AI product analyst. Instant dashbords, AB tests, AI replays (userwatch.xyz)

Show HN: StopX – AI-powered content blocker with 99.7% (stopx.today)

Show HN: A MCP server and client implementing the latest spec (github.com)

Show HN: I built a tool to use my homelab apps remotely without a full VPN (github.com)

Show HN: CongressMCP – interact with congress.gov data in natural language (github.com)

Show HN: I built a photo-to-checklist app to help with cleaning paralysis (tidylist.app)

Show HN: Operations manager agent for remote team work (get.traction.team)

Show HN: I made CSS-only glitch effect (muffinman.io)

Show HN: Jomon – a network forensics and passive sniffer tool (github.com)

Show HN: I'm 13 and I built an AI PDF Reader (github.com)

Show HN: Open-source Go Challenges – Interactive practice for interviews (github.com)

Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

Show HN: I'm building an app to replace Overleaf and Notion

Show HN: I built a loadout building and sharing tool for Helldivers 2 (helldivehelper.net)

Show HN: AI game animation sprite generator (godmodeai.cloud)

Show HN: iOS Screen Time from a REST API (thescreentimenetwork.com)

Show HN: Lambduck, a Functional Programming Brainfuck (imjakingit.github.io)

Show HN: Claude Composer (github.com)

Show HN: Container Use for Agents (github.com)

Web-scraping AI bots cause disruption for scientific databases and journals

Comments (24)