Baking the Y Combinator from scratch, Part 2: Recursion and its consequences (the-nerve-blog.ghost.io)

AI models don’t just need raw text—they need deep, structured, peer-reviewed knowledge to reason about science, medicine, engineering, and more. But most of that knowledge in the West is locked behind paywalls run by publishers like Elsevier.

Elsevier doesn’t just sell access to human readers. It aggressively enforces licenses that prohibit text and data mining for machine learning. Even universities that pay for journal access often find their AI research groups barred from using that content to train models. The terms are clear: you can read the paper—but your model can’t.

Meanwhile, China ignores these restrictions. Its researchers operate with centralized access to nearly every major Western journal. In many cases, they use institutional mirrors, semi-legal repositories, or just direct scraping. Tools like Sci-Hub are quietly tolerated or integrated into internal systems. Whether legal or not, the outcome is clear: China’s models are learning from the full scientific corpus.

In the West, researchers are stuck paying Elsevier for access, and still told they can't use it for machine learning unless they strike special deals—which are expensive, limited, or flatly denied.

Everyone talks about compute. But the real long-term advantage lies in training data. If China is feeding its models every scientific paper ever published, and Western models are trained on Reddit, Wikipedia, and scraped blogs—who's really ahead?

We’ve put up massive walls around our most valuable content and then told our own researchers to innovate with scraps. Elsevier’s copyright model was designed for print-era publishing—but it now acts as a national AI tax.

If AI is the new electricity, Elsevier is the dam. And China built a bypass.

p.s. I changed the text, after seeing how the formatting here gets stripped.

Comments (1)

incomingpain · 2h ago

I cant say im that familiar with mandarin, but i bet tokenization of their language and understanding the language with their far more complex grammar is going to make their LLMs much more challenging to produce.

English speaking countries are going to have a mega advantage here.

Show HN: RenderDay: A GPU-only render farm for Blender (renderday.com)

Triangulate – Turn-based triangle drawing game for 2 to 4 players (laisrast.github.io)

Containerization is a Swift package for running Linux containers on macOS (github.com)

WWDC 25 Keynote Thoughts (taoofmac.com)

Why Icon Rebranded to Sodax and Abandoned Its Layer-1 (coindesk.com)

Apple: AI suffers "complete accuracy collapse" in face of complex problems (theguardian.com)

Omni-Path is back on the AI, HPC menu in a new challenge to Nvidia's InfiniBand (theregister.com)

I Want to Hack to See WHO He Likes

Cal Newport: What Isaac Asimov Reveals About Living with A.I. (newyorker.com)

Baking the Y Combinator from scratch, Part 2: Recursion and its consequences (the-nerve-blog.ghost.io)

Take That, You Hockey Puck (slate.com)

You Can Be Anything [video] (youtube.com)

Agentic AI Summit 8/2 UC Berkeley-Early Bird Registration Now Open Until 6/30

Container: Apple's Linux-Container Runtime (github.com)

The urgency of saving our wallets (bluematt.bitcoin.ninja)

The Secret World of Luxury Real Estate in DC (washingtonian.com)

Headspace competitor Nomadful expands worldwide on the App Store (nomadful.io)

Bears, mice, and moles aren't enough: a better approach for preventing fraud (stytch.com)

Push Science [pdf] (paulrcohen.github.io)

New revolutionary device – 24 EEG channels with Raspberry Pi (pieeg.com)

Domains I Love (ahmedsaoudi.com)

Lovart AI-Powered Visual Design Assistant (lovart.ai)

Show HN: ForkOff – Live Fork and RPC Desync Monitor for Ethereum and L2s

As war rages in Gaza, archaeological looting in the West Bank has spiked (science.org)

Pavel Durov Speaks Out Since His Arrest in France [video] (youtube.com)

Xcode 26 Beta Release Notes (developer.apple.com)

Report claims voting machines were tampered with before 2024 elections (msn.com)

AMD EPYC 4345P 8-Core CPU Performance Review (phoronix.com)

iOS 26 Gets New 'Adaptive Power' Option to Extend Battery Life (macrumors.com)

Fact Sheet: Donald Trump Reprioritizes Cybersecurity Efforts to Protect America (whitehouse.gov)

NASA's Top Technical Challenges Countdown: #2: More Power – Universe Today (universetoday.com)

Ask HN: What are the pros and cons of different AI coding background agents?

The Making of 'New York Street Diaries' by Phil Penman (aboutphotography.blog)

Why Tesla's full self-driving is a scam (carsandhorsepower.com)

My video game sprite generator startup's "stats" page (gametorch.app)

Swift Binary Parsing (github.com)

Show HN: Chrome extension uses VSCode core to run algo trading with AI alerts (github.com)

Genomics of extremotolerant bacteria from spacecraft assembly cleanrooms (microbiomejournal.biomedcentral.com)

The Evasive Evitability of Enshittification (apenwarr.ca)

Show HN: Connect – One SDK, 1000s of integrations for your AI agent (pipedream.com)

China starts mass production of first non-binary AI chip (scmp.com)

How Social Media Brings Out the Worst in Us (greatergood.berkeley.edu)

Apple's "Illusions of Thinking" Paper Isn't Fair to LLMs (theahura.substack.com)

Aircela demonstrates prototype device that turns air into fuel (dezeen.com)

Scientists suggest that cancer is man-made (2010) (manchester.ac.uk)

Updates to Apple's On-Device and Server Foundation Language Models (machinelearning.apple.com)

Should a court break up Google? (npr.org)

OpenJK: Game and Engine Powering Jedi Academy and Jedi Outcast (github.com)

My best advice about preparing for difficult conversations (shaungallagher.pressbin.com)

Lite XL - A lightweight text editor written in Lua (github.com)

China Will Win at AI Because of Elsevier

Comments (1)