Crawlers impact the operations of the Wikimedia projects

115 edward 66 5/2/2025, 12:29:30 PM diff.wikimedia.org ↗

Comments (66)

ianso · 67d ago

The dumbest part of this is that all Wikimedia projects already export a dump for bulk downloading: https://dumps.wikimedia.org/

So it's not like you need to crawl the sites to get content for training your models...

StableAlkyne · 67d ago

I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

It's not clear which files you need, and the site itself is (or at least, was when I tried) "shipped" as some gigantic SQL scripts to rebuild the database with enough lines that the SQL servers I tried gave up reading them, requiring another script to split it up into chunks.

Then when you finally do have the database, you don't have a local copy of Wikipedia. You're missing several more files, for example category information is in a separate dump. Also you need wiki software to use the dump and host the site. After a weekend of fucking around with SQL, this is the point where I gave up and just curled the 200 or so pages I was interested in.

I'm pretty sure they want you to "just" download the database dump and go to town, but it's such a pain in the ass that I can see why someone else would just crawl it.

jsheard · 67d ago

> I tried using one of these dumps a year ago (wanted to play around and see what visualizations I could come up with based on text and the links between pages) and it was an incredibly unintuitive process.

More recently they starting putting the data up on Kaggle in a format which is supposed to be easier to ingest.

https://enterprise.wikimedia.com/blog/kaggle-dataset/

PeterStuer · 67d ago

More recently is very recently, not enough time yet for data collectors to evaluate changing processes.

StableAlkyne · 67d ago

Good timing to learn about this, given that it's Friday. Thanks! I'll check it out

GuinansEyebrows · 67d ago

I think there are engineers working for crawler companies who are paid well enough to figure out how to do this without kneecapping the most well-known noncommercial projects still surviving on the capitalized internet.

neets · 67d ago

Have you tried any of the ZIM file exports?

https://dumps.wikimedia.org/kiwix/zim/wikipedia/

Philpax · 67d ago

Yeah, it's a bit confusing at first to navigate. Luckily, they offer XML dumps that aren't too bad to work with:

1. Go to https://dumps.wikimedia.org/enwiki/latest/ (or a date of your choice in /enwiki)

2. Download https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page... and https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-page.... The first file is a bz2-multistream-compressed dump of a XML containing all of English Wikipedia's text, while the second file is an index to make it easier to find specific articles.

3. You can either:

  a. unpack the first file
  b. use the second file to locate specific articles within the first file; it maps page title -> file offset for the relevant bz2 stream
  c. use a streaming decoder to process the entire Wiki without ever decompressing it wholly

4. Once you have the XML, getting at the actual text isn't too difficult; you should use a streaming XML decoder to avoid as much allocation as possible when processing this much data.

The XML contains pages like this:

    <page>
      <title>AccessibleComputing</title>
      <ns>0</ns>
      <id>10</id>
      <redirect title="Computer accessibility" />
      <revision>
        <id>1219062925</id>
        <parentid>1219062840</parentid>
        <timestamp>2024-04-15T14:38:04Z</timestamp>
        <contributor>
          <username>Asparagusus</username>
          <id>43603280</id>
        </contributor>
        <comment>Restored revision 1002250816 by [[Special:Contributions/Elli|Elli]] ([[User talk:Elli|talk]]): Unexplained redirect breaking</comment>
        <origin>1219062925</origin>
        <model>wikitext</model>
        <format>text/x-wiki</format>
        <text bytes="111" sha1="kmysdltgexdwkv2xsml3j44jb56dxvn" xml:space="preserve">#REDIRECT [[Computer accessibility]]

    {{rcat shell|
    {{R from move}}
    {{R from CamelCase}}
    {{R unprintworthy}}
    }}</text>
        <sha1>kmysdltgexdwkv2xsml3j44jb56dxvn</sha1>
      </revision>
    </page>

so all you need to do is get at the `text`.

ks2048 · 67d ago

The bigger problem is this is wikitext markup. It would be helpful if they also provide HTML and/or plain text.

I know there are now a couple pretty-good wikitext parsers, but for years, it was a bigger problem. The only "official" one was the huge php app itself.

Philpax · 67d ago

Oh, it's godawful; the format is a crime against all things structured. I use `parse-wiki-text-2` [0], which is a fork of `parse-wiki-text`, a Rust library by an author who has now disappeared into the wind. (Every day that I parse Wikipedia, I thank him for his contributions, wherever he may be.)

I wrote another Rust library [1] that wraps around `parse-wiki-text-2` that offers a simplified AST that takes care of matching tags for you. It's designed to be bound to WASM [2], which is how I'm pretty reliably parsing Wikitext for my web application. (The existing JS libraries aren't fantastic, if I'm being honest.)

[0]: https://github.com/soerenmeier/parse-wiki-text-2

[1]: https://github.com/philpax/wikitext_simplified

[2]: https://github.com/genresinspace/genresinspace.github.io/blo...

mjevans · 67d ago

What they need to do is have 'major edits' push out an updated static render physical file like old school processes would. Then either host those somewhere as is, or also in a compressed format. (E.G. compressed weekly snapshot retained for a year?)

Also make a cname from bots.wikipedia.org to that site.

hombre_fatal · 67d ago

This probably is about on-demand search, not about gathering training data.

Crawling is more general + you get to consume it in its reconstituted form instead of deriving it yourself.

Hooking up a data dump for special-cased websites is much more complicated than letting LLM bots do a generalized on-demand web search.

Just think of how that logic would work. LLM wants to do a web search to answer your question. Some Wikimedia site is the top candidate. Instead of just going to the site, it uses this special code path that knows how to use https://{site}/{path} to figure out where {path} is in {site}'s data dump.

black_puppydog · 67d ago

Yeah. Much easier to tragedy-of-the-commons the hell out of what is arguably one of the only consistently great achievements on the web...

DarkWiiPlayer · 67d ago

> This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.

Sounds like the problem is not the crawling itself but downloading multimedia files.

The article also explains that these requests are much more likely to request resources that aren't cached, so they generate more expensive traffic.

mtmail · 67d ago

I need to work with the dump to extract geographic information. Most mirrors are not functioning, take weeks to catch up or block, or only mirror english wikipedia. Every other month I find a work-around. It's not easy to work with the full dumps, but I guess/hope easier than crawling wikipedia website itself.

Ekaros · 67d ago

Why use screwdriver when you have sledge hammer and everything is a nail?

bombcar · 67d ago

nAIl™ - the network AI library. For sledgehammering all your screwdriver needs.

qudat · 67d ago

I thought that as well but maybe this is more for indexing search engines? In which case you want more realtime updates?

cubefox · 67d ago

I don't see an obvious option to download all images from Wikipedia Commons. As the post clearly indicates, the text is not the issue here, its the images.

mistrial9 · 67d ago

it seems like Wikimedia Foundation has always been protective of the image downloads since the 90s. So many drunken midnight scripters or new urban undergrad CEOs discovers that they can download cool images fairly quickly. AFAIK there has always been some kind of text corpus available in bulk because it is part of the mission of Wikipedia. But the image gallery is big on disk, big bandwidth compared to TEXT, and low hanging target for the uninformed, greedy and etc.

indrora · 67d ago

The Wikimedia nonfree image limitations have been a pain in my ass for years.

For those unfamiliar: The images that are marked NonFree must be smaller than 1Megapixel. 1155 X 866. In practice, 1024x768 is around the maximum size.

bombcar · 67d ago

This is what torrents are built for.

A torrent of all images updated once a year would probably do quite well.

edoceo · 67d ago

Provided you have enough seed nodes - not free.

bombcar · 67d ago

I have excess bandwidth in various places - would be happy to seed.

OtherShrezzing · 67d ago

This phenomenon is the wilful destruction of valuable global commons at the hands of a very small number of companies. The number of individually accountable decision-makers driving this destruction is probably in the dozens or low hundreds.

blablabla123 · 67d ago

Everybody and their dog are writing AI scrapers with Captcha passing functionality these days. None of that is new but the scale is unseen.

The thing is also that corporate scrapers are in comparison even the good guys, they respect robots.txt, have properly set user agents etc. Others might do neither and from residential IPs.

The issue isn't even new but any proper solution attempts have been postponed because CDNs seem such an easy solution.

whoopdedo · 67d ago

This has become a concern for the Arch Linux wiki which now makes you pass a proof-of-work challenge to read it. Which my anti-fingerprinting browser fails at every time. Putting a burden on human readers that will be only a minor temporary annoyance for the bots. Think about it, the T in CAPTCHA stands for "Turing". What is the design goal of AI? To create machines that can pass a Turing test.

I fear the end state of this game is the death of the anonymous internet.

PaulDavisThe1st · 67d ago

The point of (correctly done) proof-of-work is not to require Turing-level impersonation. It is to create a cost to a trawler that is going to hit thousands or more of your pages, and almost no cost to a human user.

Problem is, as you've discovered, it can have the cost that anti-fingerprinting browsers can't do the required work.

whoopdedo · 67d ago

These are AI bots. Computational capacity is not a limiting factor. I'd argue that my desktop consumer PC is less capable of efficiently solving a PoW than a multi-GPU cluster in a data center.

Even if, as you say, crawlers will hit the PoW thousands of times more, the only way to make it a barrier is if the cost is higher than the profit to be gained. Otherwise it's merely an expense to be passed on to the customer.

PaulDavisThe1st · 67d ago

Anubis accomplishes this, AFAIU.

tfederman · 67d ago

A while back I wrote up a way to turn the big Wikipedia XML dump into a database. Not a generic table with articles but thousands of tables, one for each article "type". I'm not sure if this is still the best way to go about it.

https://feder001.com/exploring-wikipedia-as-a-database-part-...

cubefox · 67d ago

Contrary to what most commenters assume, the high bandwidth usage is not coming from scraping text, but images. They are pretty clear about it:

> Since January 2024, we have seen the bandwidth used for downloading multimedia content grow by 50%. This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models.

mschuster91 · 67d ago

There's two distinct problems caused by AI scrapers:

1. Bandwidth consumption - that's on scrapers downloading multimedia files

2. CPU resource exhaustion - AI scrapers don't take contextual clues into account. They just blindly follow each and every link they can find, which means that they hit a lot of pages that aren't cached but re-generated for each call. That's stuff like the article history but especially the version delta pages. These are very expensive to generate and are so rarely called that it doesn't make sense to cache them.

alt227 · 67d ago

I havent read the article, but why dont they just put it behind a free login with bandwidth restrictions per day or something?

cubefox · 67d ago

You want images only be available to users with a Wikipedia login? This would mean by far most people would no longer see images in Wikipedia articles.

alt227 · 67d ago

No, I am saying what a lot of other people are. Force bots into API access, which can then be authenticated and restricted by bandwidth or calls per day. Then block bot access to html pages. Nobody looses their images, and bots are limited in stealing bandwidth.

joepie91_ · 67d ago

Have you actually tried blocking these scraper bots? The whole problem is that if you do, they start impersonating normal browsers from residential IPs instead. They actively evade countermeasures.

alt227 · 62d ago

Isnt everything measures and countermeasures though?

As far as I am aware there is no such thing as a silver bullet anywhere when it comes to security.

Its like moving your SSH port from port 22 to some other random one. Will it stop advanced scripts from scanning your server and finding it? No, but it sure as hell will cut down the noise of unsophisticated connections which means you can focus on the more tough ones.

dmitrygr · 67d ago

Finally, a use for CFAA?

SoftTalker · 67d ago

OK, but what's the downside?

fpgaminer · 67d ago

Maybe this is an insane idea, but ... how about a spider P2P network?

At least for local AIs it might not be a terrible idea. Basically a distributed cache of the most common sources our bots might pull from. That would mean only a few fetches from each website per day, and then the rest of the bandwidth load can be shared amongst the bots.

Probably lots of privacy issues to work around with such an implementation though.

gnabgib · 67d ago

Previous discussions:

(91 points, 30 days ago, 101 comments) https://news.ycombinator.com/item?id=43555898

(49 points, 29 days ago, 45 comments) https://news.ycombinator.com/item?id=43562005

kordlessagain · 67d ago

Wikimedia's recent post completely misses the mark. What they're experiencing isn't merely bulk data collection – it's the unauthorized transformation of their content infrastructure into a free API service for commercial AI tools.

It's not crawling for training that is the issue...and it's an over simplification stating that AI companies are "training" on someone's data.

When systems like Claude and ChatGPT fetch Wikimedia content to answer user queries in real time, they're effectively using Wikimedia as an API – with zero compensation, zero attribution, and zero of the typical API management that would come with such usage. Each time a user asks these AI tools a question, they may trigger fresh calls to Wikimedia servers, creating a persistent, on-demand load rather than a one-time scraping event.

The distinction is crucial. Traditional search engines like Google crawl content, index it, and then send users back to the original site. These AI systems instead extract the value without routing any traffic back, breaking the implicit value exchange that has sustained the web ecosystem.

Wikimedia's focus on technical markers of "bot behavior" – like not interpreting JavaScript or accessing uncommon pages – shows they're still diagnosing this as a traditional crawler problem rather than recognizing the fundamental economic imbalance. They're essentially subsidizing commercial AI products with volunteer-created content and donor-funded infrastructure.

The solution has been available all along. HTTP 402 "Payment Required" was built into the web's foundation for exactly this scenario. Combined with the Lightning Network's micropayment capabilities and the L402 protocol implementation, Wikimedia could:

  - Keep content free for human users
  
  - Charge AI services per request (even fractions of pennies would add up)
  
  - Generate sustainable infrastructure funding from commercial usage
  
  - Maintain their open knowledge mission while ending the effective subsidy

Tools like Aperture make implementation straightforward – a reverse proxy that distinguishes between human and automated access, applying appropriate pricing models to each.

Instead of leading the way toward a sustainable model for knowledge infrastructure in the AI age, Wikimedia is writing blog posts about traffic patterns. If your content is being used as an API, the solution is to become an API – with all the management, pricing, and terms that entails. Otherwise, they'll continue watching their donor resources drain away to support commercial AI inference costs.

I suspect several factors contribute to this resistance:

Ideological attachment to "free" as binary rather than nuanced: Many organizations have built their identity around offering "free" content, creating a false dichotomy where any monetization feels like betrayal of core values. They miss that selective monetization (humans free, automated commercial use paid) could actually strengthen their core mission.

Technical amnesia: The web's architects built payment functionality into HTTP from the beginning, but without a native digital cash system, it remained dormant. Now that Bitcoin and Lightning provide the missing piece, there's institutional amnesia about this intended functionality.

Complexity aversion: Implementing new payment systems feels like adding complexity, when in reality it simplifies the entire ecosystem by aligning incentives naturally rather than through increasingly byzantine rate-limiting and bot-detection schemes.

The comfort of complaint: There's a certain organizational comfort in having identifiable "villains" (bots, crawlers, etc.) rather than embracing solutions that might require internal change. Blog posts lamenting crawler impacts are easier than implementing new systems.

False democratization concerns: Some worry that payment systems would limit access to those with means, missing that micropayments precisely enable democratization by allowing anyone to pay exactly for what they use without arbitrary gatekeeping.

pjc50 · 67d ago

Micropayments are never the solution, and trying to charge for something built by volunteers would indeed detonate the social contract.

But so does unrestricted AI use. I guess the nice things era is over.

kordlessagain · 66d ago

The irony is that Wikimedia already pays for the cost of serving pages — it's just invisible to users because donors cover it. Micropayments via Lightning aren't about "charging for knowledge," they're about sustainable access models in the face of high-frequency bot loads (especially from AI). If AI crawlers are consuming massive resources, it's not unreasonable to explore accountability — not for readers, but for automated extractors.

And, even better, those micropayments could be shared with the volunteers. How about a big party for them, or gifts on Amazon for good behavior? How about a simple birthday card? There's a lot that can be done with resources like this!

evertedsphere · 67d ago

why not just post the prompt you used and let readers feed it into the language model themselves

kordlessagain · 66d ago

Because I evolved the prompt from something that was off to something that made sense. I don't see using any resource as a problem as long as the content is bang on.

The "prompt" as you call it, isn't a sinlge prompt. It's a long discussion that includes the article (which I copy pasta'd) and other references I've worked on recently (I write crawlers and have for years)

greatgib · 67d ago

From what I understand, the problem is not really from scrapers that would "pound" the service by being thousands of requesting the same things multiple times, but that they are scraping the whole of Wikipedia including heavy content, like video that is not accessed often.

If that is the case, I would think that it is a little bit concerning that the model of Wikipedia is based on having most resources not accessed.

Otherwise, if my understanding is wrong, it would mean that AI company are constantly scraping the same content for change like a search engine would do, but it does little sense to me as I easily guess that models are only trained once every few months at most.

And also I don't understand how they were not already encountering this problem with the existing constant crawling of search engines...

smjburton · 67d ago

> When an article is requested multiple times, we memorize – or cache – its content in the datacenter closest to the user. If an article hasn’t been requested in a while, its content needs to be served from the core data center.

Maybe a similar system needs to be set up so that bot requests need to present their latest cache or hash ID of the requested content before a full request can be granted. This way, if the local cache is recent, it doesn't burden the server with requests for content they've already seen, and they can otherwise serve their users information based on the version they have stored locally.

OtherShrezzing · 67d ago

This already exists in the form of ‘if-modified-since’ http cache property. A requester can ask the server if the data they currently have is stale within the current spec.

In my experience, the problematic crawlers choose not to implement this feature.

cubefox · 67d ago

The comment below the post makes a lot of sense:

> I suggest Wikimedia to distribute Wikimedia Commons content using tape drive. The largest tape drive (IBM 3592) can store 50 TB content. The total size of Wikimedia Commons is 610.4 TB. So it needs less than 15 tapes to store the entire site. You can lend the tapes to any company want your content, if they promise to return a in period of time.

alt227 · 67d ago

Except that as soon as the tapes are written they are obselete and the data is stale, and so they realistically cannot be accepted as a valid copy of the data.

cubefox · 67d ago

Don't know about that, but the necessary tape drive (IBM TS1170 according to Wikipedia) seems to be very hard to come by, as I couldn't even find a price via Google. It might be a better option to put the data on ~24 Seagate HDDs with a capacity of 26 TB each.

ashoeafoot · 66d ago

Dont block them, feed them delurous feverdreamed llamassporn..

periodjet · 67d ago

Nothing speaks quite so clearly to the ideological lean of the Wikimedia Foundation as their choice of social media links: “Share on: Mastodon, Bluesky”

varjag · 67d ago

Oh no Wikipedia isn't using the social media owned by the guy who called for boycott and penalizing of Wikipedia.

ks2048 · 67d ago

Yes, the site dedicated to open data should prefer social media that has an open data model.

PeterStuer · 67d ago

Here's what I don't get. Wikimedia claims to be a nonprofit for spreading knowledge. They sit on nearly half a billion of assets.

Every customer would prefer a firehose content delta over having to scrape for diffs.

They obviously have the capital to provide this, and still grow their funds for eternity without ever needing a single dollar in external revenue.

Why don't they?

joshuaissac · 67d ago

They provide database dumps already, and those dumps have the diff information. Crawlers are ignoring the dumps and scraping the websites anyway.

PeterStuer · 67d ago

Have they ever asked the customers why they prefer scraping over the data deltas?

ks2048 · 67d ago

I would bet the answer is it is easier to write a script the simply downloads everything it can (foreach <a href=>: download and recurse), rather than looking into which sites provide data dumps and how to use them.

PeterStuer · 66d ago

So the solution would be an edgecached site exactly like the full site with just the deltas since a periodic timepoint?

The crawler still crawls but can confidently rest assured it still has all the info with the base+delta as if it had recrawled everyting?

Palomides · 67d ago

>Every customer would prefer a firehose content delta over having to scrape for diffs.

customers is a strong word, especially when you're saying they should be providing a new service useful, more or less exclusively, to AI startups and megacorps

PeterStuer · 66d ago

Why not? If your mission as a non-profit is to share the knowledge, are these not just welcome new value adding channels to achieve that goal?

add-sub-mul-div · 67d ago

Why should they create a whole new architecture to support when you can find changed articles between two dumps with a simple query? I'd rather load a big file into a database than maintain a firehose consumer.

PeterStuer · 66d ago

That is a question of latency.

Ask HN: What Problem Would You Solve with Unlimited Resources?

Pocket LLM Server Just Like a Pocket WiFi

Ask HN: How did Soham Parekh get so many jobs?

Ask HN: People who work different timezones than your company. How sched?

Ask HN: What are some cool or underrated tech companies based in Australia?

Ask HN: How are you making money on the side?

Ask HN: What are some cool or underrated tech companies based in Canada?

Ask HN: How is the tech scene in LA?

Ask HN: Has anyone else learned English just by reading tech posts (like HN)?

Ask HN: What's the verdict on GPT wrapper companies these days?

Ask HN: What Are You Working On? (June 2025)

Agentic terminology doesn't make any sense

Ask HN: Any resources for finding non-smart appliances?

Ask HN: Freelancer? Seeking freelancer? (July 2025)

N8n AI Workflows – 3,400 Workflows and an LLM Prototype

Ask HN: Who is hiring? (July 2025)

Ask HN: Worth leaving position over push to adopt vibe coding?

Ask HN: Do you use LLM for HTML translations?

Are there any noteworthy LinkedIn alternatives?

Ask HN: What inspires you to persevere through adversity?

Ask HN: Who wants to be hired? (July 2025)

Ask HN: Took a break after burnout – what now?

Ask HN: Advice for Starting a Hacker Space?

Slack is just the worst – and I've used a BBS and 14.4k modem

Ask HN: What's the 2025 stack for a self-hosted photo library with local AI?

Proposal: GUI-first, text-based mechanical CAD inspired by software engineering

Ask HN: What happened to W3C's PROV initiative to add provenance to the Web?

Ask HN: Brick and Mortar Dev Agency

Which email clients work well with keyboard shortcuts?

Ask HN: How do you deal with data backups in servers?

Ask HN: HN was much more interesting a year ago

Ask HN: What's the greatest piece of non-dogfooded software?

Ask HN: How many communities HN it devs in C language?

1KB JavaScript Demoscene Challenge Just Launched

Super Simple "Hallucination Traps" to detect interview cheaters

Ask HN: Is there a business for extracting US tech talent?

Why did not numpy copy the J rank concept?

Ask HN: What old or outdated software have you never found a replacement for?

Looking for Early Testers for a AI Assistant Inside Zotero

Ask HN: MiniNAS Experience

Ask HN: How do I buy a typewriter?

mTLS vs. HTTP Message Signatures: Tradeoffs in Securing HTTP Requests

Ask HN: How do you sell to B2B in current state of AI?

Ask HN: How to generate product docs E2E?

Tell HN: A fake, highly obfuscated Solidity VSCode plugin found on marketplace

Ask HN: Why there is no demand for my SaaS when competition is killing it?

ARZY-G: A token born from AI-validated usefulness (not mined, not bought)

Ask HN: What are the best resources to help with health insurance denials?

If Emacs is not a text editor, then what is it really?

CellularLab – A Modern Android iPerf3 App with TCP/UDP Testing and AI Analysis

Crawlers impact the operations of the Wikimedia projects

Comments (66)