The Ant Mill: How theoretical high-energy physics descended into groupthink (jespergrimstrup.substack.com)

One aspect of this ruling [1] that I find concerning: on pages 7 and 11-12, it concedes that the LLM does substantially "memorize" copyrighted works, but rules that this doesn't violate the author's copyright because Anthropic has server-side filtering to avoid reproducing memorized text. (Alsup compares this to Google Books, which has server-side searchable full-text copies of copyrighted books, but only allows users to access snippets in a non-infringing manner.)

Does this imply that distributing open-weights models such as Llama is copyright infringement, since users can trivially run the model without output filtering to extract the memorized text?

[1]: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

riskable · 1h ago

A judge already ruled that models themselves don't constitute copyright infringement in Kadrey v. Meta Platforms, Inc. (https://casetext.com/case/kadrey-v-meta-platforms-inc). The EFF has a good summary about it:

> the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works.

See: https://www.eff.org/deeplinks/2025/02/copyright-and-ai-cases...

comex · 45m ago

Yes and no.

In this case, the plaintiffs alleged that Anthropic's LLMs had memorized the works so completely that "if each completed LLM had been asked to recite works it had trained upon, it could have done so", "almost verbatim". The judge assumed for the sake of argument that the allegation was true, and ruled that the conduct was fair use anyway due to the existence of an effective filter. Therefore there was no need to determine whether the allegation was actually true.

So - yes, in the sense that the ruling suggests that distributing an open-weight LLM that memorized copyrighted works to that extent would not be fair use.

But no, in the sense that it's not clear whether any LLMs, especially open-weight LLMs, actually memorize book-length works to that extent. Even the recent study about Llama memorizing a Harry Potter book [1] only said that Llama could reproduce 50-token snippets a decent percentage of the time when given the preceding 50 tokens. That's different from actually being able to recite any substantial portion of the book. If you asked Llama for that, the output would quickly diverge from the original text, and it likely wouldn't be able to get back on track without being re-prompted from the ground truth as the study did.

On the other hand, in the case where the New York Times is suing OpenAI, the NYT has alleged that ChatGPT was able to recite extensive portions of NYT articles verbatim. If true, this might be more dangerous, since news articles are not as long as books but they're equally eligible for copyright protection. So we'll see how that shakes out.

Also note:

- Nothing in the opinion sets formal precedent because it's a district court. But the opinion might still influence later judges.

- See also riskable's sibling comment for another case where a judge addressed the issue more head-on (but wasn't facing the same kind of detailed allegations, I don't think; haven't checked).

[1] https://arxiv.org/abs/2412.06370

ticulatedspline · 3h ago

Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

Additionally that if you download a model file that contains enough of the source material to be considered infringing (even without using the LLM, assume you can extract the contents directly out of the weights) then it might as well be a .zip with a PDF in it, the model file itself becomes an infringing object whereas closed models can be held accountable by not what they store but what they produce.

bonoboTP · 33m ago

This technology is a really bad way of storing, reproducing and transmitting the books themselves. It's probabilistic and lossy. It may be possible to reproduce some paragraphs, but no reasonable person would expect to read The Da Vinci Code by prompting the LLM. Surely the marketed use cases and the observed real use by users has to make it clear that the intended and vastly overwhelming use of an LLM is transformative, "digestive" synthesis of many sources to construct a merged, abstracted, generalized system that can function in novel uses, answering never before seen prompts in a useful manner, overwhelmingly without reproducing existing written works. It surely matters what the purpose of the thing is both in intention and observed practice. It's not a viable competing alternative to reading the actual book.

lcnPylGDnU4H9OF · 1h ago

> broadly capable open models are on track for annihilation

I'm not so sure about this one. In particular, presuming that it is found that models which can produce infringing material are themselves infringing material, the ability to distill models from older models seems to suggest that the older models can actually produce the new, infringing model. It seems like that should mean that all output from the older model is infringing because any and all of it can be used to make infringing material (the new model, distilled from the old).

I don't think it's really tenable for courts to treat any model as though it is, in itself, copyright-infringing material without treating every generative model like that and, thus, killing the GPT/diffusion generation business (that could happen but it seems very unlikely). They will probably stick to being critical of what people generate with them and/or how they distribute what they generate.

dragonwriter · 47m ago

> a model file that contains enough of the source material to be considered infringing

The amount of the source material encoded does not, alone, determine if it is infringing, so this noun phrase doesn't actually mean anything. I know there are some popular myths that contradict this (the commonly-believed "30-second rule" for music, for instance), but they are just that, myths.

fallingknife · 4m ago

But there is the issue of whether there are damages. If my LLM can reproduce 10 random paragraphs of a Harry Potter book, it's obvious that nobody would have otherwise purchased the book if they couldn't read those 10 paragraphs. So there will not be any damages to the publisher and the lawsuit will be tossed. There is a threshold of how much of it needs to be reproduced, and how closely, but it's a subjective standard and not some hard line like if it's > 50%.

vinni2 · 1h ago

> extract the contents directly out of the weights

If you can successfully demonstrate that then yes it is a copyright infringement and successfully doing that would be worthy of NeurIPS or ACL paper.

CamperBob2 · 1h ago

Yep, broadly capable open models are on track for annihilation. The cost of legally obtaining all the training materials will require hefty backing.

This will have the effect of empowering countries (and other entities) that don't respect copyright law, of course.

The copyright cartel cannot be allowed to yank the handbrake on AI. If they insist on a fight, they must lose.

throwaway562if1 · 1h ago

For that matter, how dare the government fine me for dumping waste in the river, and stop me from employing minors? Don't they know it will ruin the economy?

clvx · 2h ago

Wouldn’t the issue be executing the models to third parties without filters? No idea if this is right but the same it would apply to Anthropic that they couldn’t run the model without the filter system having a chicken an egg problem. Can’t develop the filter without looking into the model.

No comments yet

deadbabe · 3h ago

You can use the copyrighted text for personal purposes.

dragonwriter · 44m ago

You can also, in the US, use it for any purposes which fall within the domain of "fair use", which while now also incorporated in the copyright statute, was first identified as an application of the first amendment and, as such, a constitutional limit on what Congress even had the power to prohibit with copyright law (the odd parameters of the statutory exception are largely because it attempted to codify the existing Constitutional case law.)

Purposes which are fair use are very often not at all personal.

(Also, "personal use" that involves copying, creating a derivative work, or using any of the other exclusive rights of a copyright holder without a license or falling into either fair use or another explicit copyright exception are not, generally, allowed, they are just hard to detect and unlikely to be worth the copyright holder's time to litigate even if they somehow were detected.)

layer8 · 3h ago

But you can’t distribute it, which in the scenario mentioned in the parent’s final paragraph arguably happens.

AnthonyMouse · 2h ago

You can't distribute the copyrighted works, but that isn't inherently the same thing as the model.

It's sort of like distributing a compendium of book reviews. Many of the reviews have quotes from the book. If there are thousands of reviews, you could potentially reconstruct the whole book, but that's not the point of the thing and so it makes sense for the infringing thing to be "using it to reconstruct the whole book" rather than "distributing the compendium".

And then Anthropic fended off the argument that their service was intended for doing the former because they were explicitly taking measures to prevent that.

layer8 · 1h ago

The premise was that the model is able to reproduce the memorized text, and that what saved Anthropic was them having server-side filtering to avoid reproducing that text. So the presumption is that without those filters, the model would be able to reproduce text substantial enough to constitute a copyright violation (otherwise they wouldn’t need the filter argument). Distributing a “machine” producing such output would constitute copyright infringement.

Maybe this is a misrepresentation of the actual Anthropic case, I have no idea, but it’s the scenario I was addressing.

AtlasBarfed · 3h ago

Hey can I have a fake llm "trained" on a set of copyrighted works to ask what those works are?

So it totally isn't a warez streaming media server but AI?

I'm guessing since my net worth isn't a billion plus, the answer is no

AnthonyMouse · 1h ago

People have been coming up with convoluted piracy loopholes since the invention of copyright.

If you xor some data with random numbers, both the result and the random numbers are indistinguishably random and there is no way to tell which one came out of a random number generator and which one is "derived" from a copyrighted work. But if you xor them together again the copyrighted work comes out. So if you have Alice distribute one of the random looking things and Bob distribute the other one and then Carol downloads them both and reconstructs the copyrighted work, have you created a scheme to copy whatever you want with no infringement occurring?

Of course not, at least Carol is reproducing an infringing work, and then there are going to be claims of contributory infringement etc. for the others if the scheme has no other purpose than to do this.

Meanwhile this problem is also boring because preventing anyone from being the source of infringing works isn't a thing anybody has been able to do since at least as long as the internet has allowed anyone to set up a server in another jurisdiction.

martin-t · 59m ago

Copyright was codified in an age where plagiarism was time consuming. Even replacing words with synonyms on a mass scale was technically infeasible.

The goal of copyright is to make sure people can get fair compensation for the amount of work they put in. LLMs automate plagiarism on a previously unfathomable scale.

If humans spend a trillion hours writing books, articles, blog posts and code, then somebody (a small group of people) comes and spends a million hours building a machine that ingests all the previous work and produces output based on it, who should get the reward for the work put in?

The original authors together spent a million times more effort (normalized for skill) and should therefore should get a million times bigger reward than those who build the machine.

In other words, if the small group sells access to the product of the combined effort, they only deserve a millionth of the income.

---

If "AI" is as transformative as they claim, they will have no trouble making so much money they they can fairly compensate the original authors while still earning a decent profit. But if it's not, then it's just an overpriced plagiarism automator and their reluctance to acknowledge they are making money on top of everyone else's work is indicative.

bonoboTP · 27m ago

> get fair compensation for the amount of work

This is a bit distorted. This is a better summary: The primary purpose of copyright is to induce and reward authors to create new works and to make those works available to the public to enjoy.

The ultimate purpose is to foster the creation of new works that the public can read and written culture can thrive. The means to achieve this is by ensuring that the authors of said works can get financial incentives for writing.

The two are not in opposition but it's good to be clear about it. The main beneficiary is intended to be the public, not the writers' guild.

Therefore when some new factor enters the picture such as LLMs, we have to step back and see how the intent to benefit the reading public can be pursued in the new situation. It certainly has to take into account who and how will produce new written works, but it is not the main target, but can be an instrumental subgoal.

blindriver · 6m ago

Humans read books. AI/LLMs do not read. I think there's an inherent difference here. If the LLM is making a copy of the entire book in it's memory, is that copyright infringement? I don't know the answer to that, but it feels like Alsup is considering this fair use argument in the context of a human, but it's nothing like a human and needs to be treated differently.

Fluorescence · 36m ago

I'm surprised we never discuss a previous case of how governments handled a valuable new technology that challenged creative's ability to monetise their work:

Cassette Tapes and Private Copying Levy.

https://en.wikipedia.org/wiki/Private_copying_levy

Governments didn't ban tapes but taxed them and fed the proceeds back into the royalty system. An equivalent for books might be an LLM tax funding a negative tax rate for sold books e.g. earn $5 and the gov tops it up. Can't imagine how to ensure it was fair though.

Alternatively, might be an interesting math problem to calculate royalties for the training data used in each user request!

bonoboTP · 20m ago

Surely this would require the observation that the public is actually using LLMs as a substitute for purchasing the book, ie they sit down and type "Generate me the first/second/third chapter of The Da Vinci Code" and then read if from there. Because it was easy to observe in the cassette tape era that people copied the store bought music and films and shared it among each other. I doubt that this is or will be a serious use case of LLMs.

dmix · 6m ago

That's a very different use case IMO. An LLM isn't generating a replica of a book for the users. At most we've seen people able to reproduce exact portions of stuff, but only with lots of prior knowledge of the material by the human in the loop and plenty of manual effort (aka not a direct commercial threat). And that was before more LLMs put effort into stopping that sort of hacking.

The last thing the world needs is more nonsensical copyright law and hand wavy regulation funded by entrenched interests.

3PS · 4h ago

Broadly summarizing.

This is OK and fair use: Training LLMs on copyrighted work, since it's transformative.

This is not OK and not fair use: pirating data, or creating a big repository of pirated data that isn't necessarily for AI training.

Overall seems like a pretty reasonable ruling?

derbOac · 4h ago

But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine. I guess I fail to see how it's any different from me using it in some other way? If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.

I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission." Maybe if they suddenly loosened copyright enforcement for everyone I might feel differently.

"Kill one man, and you are a murderer. Kill millions of men, and you are a conqueror." (An admittedly hyperbolic comparison, but similar idea.)

rcxdude · 4h ago

>If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.

I think that's the conclusion of the judge. If Anthropic were to buy the books and train on them, without extra permission from the authors, it would be fair use, much like if you were to be inspired by it (though in that case, it may not even count as a derivative work at all, if the relationship is sufficiently loose). But that doesn't mean they are free to pirate it either, so they are likely to be liable for that (exactly how that interpretation works with copyright law I'm not entirely sure: I know in some places that downloading stuff is less of a problem than distributing it to others because the latter is the main thing that copyright is concerned with. And AFAIK most companies doing large model training are maintaining that fair use also extends to them gathering the data in the first place).

(Fair use isn't just for discussion. It covers a broad range of potential use cases, and they're not enumerated precisely in copyright law AFAIK, there's a complicated range of case law that forms the guidelines for it)

tsumnia · 4h ago

I think the issue is that its actually quite difficult to "unlearn" something once you've seen it. I'm speaking more from human-learning rather than AI-learning, but since AI is inspired by our view on nature, it will have similar qualities. If I see something that inspires, regardless of if I paid for that, I may not even know what specifically inspired me. If I sit on a park bench and an idea comes to me, it could come from a number of things - the bench, park, weather, what movie I watched last night, stuff on the wall of a restaurant while I was eating there, etc.

While humans don't have encyclopedic memories, our brain connects a few dots to make a thought. If I say "Luke, I am your father", it doesn't matter that isn't even the line is wrong, anyone that's seen Star Wars knows what I'm quoting. I may not be profiting from using that line, but that doesn't stop Star Wars from inspiring other elements of my life.

I do agree that copyright law is complicated and AI is going to create even more complexity as we navigate this growth. I don't have a solution on that front, just a recognition that AI is doing what humans do, only more precisely.

altruios · 4h ago

which AFAIN IANAL, copyright and exhaustive rights are completely different. Under copyright, once a book is purchased: that's it. Reselling the same, or transformed (re: highlighted) worked 'used' is 100% legal, as is consuming it at your discretion (in your mind {a billion times}, a fire, or (yes even) what amounts to a fancy calculator).

(that's all to say copyright is dated and needs an overhaul)

But that's taking a viewpoint of 'training a personal AI in your home', which isn't something that actually happens... The issue has never been the training data itself. Training an AI and 'looking at data and optimizing a (human understanding/AI understanding) function over it' are categorically the same, even if mechanically/biologically they are very different.

comex · 37m ago

The judge actually agreed with your first paragraph:

> This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.

(But the judge continued that "this order need not decide this case on that rule": instead he made a more targeted ruling that Anthropic's specific conduct with respect to pirated copies wasn't fair use.)

dragonwriter · 4h ago

> I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission."

That's not what the ruling says.

It says that training a generative AI system not designed primarily as a direct replacement for a work on one or more works is fair use, and that print-to-digital destructive scanning for storage and searchability is fair use.

These are both independent of whether one person or a giant company or something in between is doing it, and independent of the number of works involved (there's maybe a weak practical relationship to the number of works involved, since a gen AI tool that is trained on exactly one work is probably somewhat less likely to have a real use beyond a replacement for that work.)

tantalor · 3h ago

The analogy to training is not writing a play based on the work. It's more like reading (experiencing) the work and forming memories in your brain, which you can access later.

I'm allowed to hear a copyrighted tune, and even whistle it later for my own enjoyment, but I can't perform it for others without license.

AlienRobot · 2h ago

This is nonsense, in my opinion. You aren't "hearing" anything. You are literally creating a work, in this case, the model, derived from another work.

People need to stop anthropomorphizing neural networks. It's a software and a software is a tool and a tool is used by a human.

adinisom · 28m ago

Humans are also created/derived from other works, trained, and used as a tool by humans.

It's interesting how polarizing the comparison of human and machine learning can be.

tantalor · 2h ago

It is easy to dismiss, but the burden of proof would be on the plaintiff to prove that training a model is substantially different than the human mind. Good luck with that.

klabb3 · 4h ago

> But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine.

Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work. The entire thing is based on economic pragmatism, because just copying does obviously not deprive the creator of the work itself, so the only justification in the first place is to protect those who seek to sell immaterial goods, by allowing them to decide how it can be used.

Coming to the conclusion that you can ”fair use” yourself out of paying for the most critical part of your supply makes me upset for the victims of the biggest heist of the century. But in the long term it can have devastating chilling effects, where information silos will become the norm, and various forms of DRM will be even more draconian.

Plus, fair use bypasses any licensing, no? Meaning even if today you clearly specify in the license that your work cannot be used in training commercial AI, it isn’t legally enforceable?

growse · 3h ago

> Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work.

This makes no sense. If I buy and read a book on software engineering, and then use that knowledge to start a career, do I owe the author a percentage of my lifetime earnings?

Of course not. And yet I've made money with the help of someone else's intellectual work.

lurkshark · 3h ago

If you pirate a book on software engineering and then use that knowledge to start a career, do you owe the author the royalties they would be paid had you bought the book?

If the career you start isn't software engineering directly but instead re-teaching the information you learned from that book to millions of paying students, is the regular royalty payment for the book still fair?

ticulatedspline · 4h ago

Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"

Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.

In this case if you hired a bunch of artists/writers that somehow had never seen a Disney movie and to train them to make crappy Disney clones you made them watch all the movies it certainly would be legal to do so but only if they had legit copies in the training room. Pirating the movies would be illegal.

Though the downside is it does create a training moat. If you want to create the super-brain AI that's conversant on the corpus of copyrighted human literature you're going to need a training library worth millions

martin-t · 47m ago

> Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.

Human time is inherently valuable, computer time is not.

The issue with LLMs is that they allow doing things at a massive scale which would previously be prohibitively time consuming. (You could argue but them how much electricity is worth one human life?)

If I "write" a book by taking another and replacing every word with a synonym, that's obviously plagiarism and obviously copyright infringement. How about also changing the word order? How about rewording individual paragraphs while keeping the general structure? It's all still derivative work but as you make it less detectable, the time and effort required is growing to become uneconomical. An LLM can do it cheaply. It can mix and match parts of many works but it's all still a derivative of those works combined. After all, if it wasn't, it would produce equally good output with a tiny fraction of the training data.

The outcome is that a small group of people (those making LLMs and selling access to their output) get to make huge amounts of money off of the work of a group that is several orders of magnitude larger (essentially everyone who has written something on the internet) without compensating the larger group.

That is fundamentally exploitative, whether the current laws accounted for that situation or not.

johnnyanmac · 4h ago

That's a part of the issue. I'm not sure if this has happened in visual arts, but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.

I see elements of that here. Buying copyrighted works not to be exposed and be inspired, nor to utilize the aithor's talents, but to fuel a commercialization of sound-a-likes.

lesuorac · 4h ago

> You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet"

Keep in mind, the Authors in the lawsuit are not claiming the _output_ is copyright infringement so Alsup isn't deciding that.

Dracophoenix · 2h ago

> but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.

You're referencing Midler v Ford Motor Co in the 9th circuit. This case largely applies to California, not the whole nation. Even then, it would take one Supreme Court case to overturn it.

alganet · 3h ago

What you are describing happened and they got sued:

https://en.wikipedia.org/wiki/Mickey_Mouse#Walt_Disney_Produ...

I'm on the Air Pirates side for the case linked, by the way.

However, AI is not a parody. It's not adding to the cultural expression like a parody would.

Let's forget all the law stuff and these silly hypotheticals. Let's think of humanity instead:

Is AI contributing to education and/or culture _right now_, or is it trying to make money? I think they're trying to make money.

tgv · 4h ago

> Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"

How many copies? They're not serving a single client.

Libraries need to have multiple e-book licenses, after all.

ticulatedspline · 4h ago

In the human training case probably a Store DVD would still run afoul of that licensing issue. That's a broader topic of audience and I didn't want to muddy the analogy with that detail.

It changes the definition of what a "legal copy" is but the general idea that the copy must be legal still stands.

tgv · 3h ago

Fair enough.

simmerup · 4h ago

Depends whether you actually agree its transformative

lesuorac · 4h ago

For textual purposes it seems fairly transformative.

If you train a LLM on harry potter and ask it to generate a story that isn't harry potter then it's not a replacement.

However, if you train a model on stock imagery and use it to generate stock imagery then I think you'll run into an issue from the Warhol case.

sidewndr46 · 4h ago

Wasn't that just over an arrangement of someone else's photographs?

lesuorac · 4h ago

https://en.wikipedia.org/wiki/Andy_Warhol_Foundation_for_the...

I wouldn't call it that. Goldsmith took a photograph of Prince which Warhol used as a reference to generate an illustration. Vanity Fair then chose to buy a license Warhol's print instead of Goldsmith's photograph.

So, despite the artwork being visual transformative (silkscreen vs photograph) the actual use was not transformed.

johnnyanmac · 4h ago

The nature of how they store data makes it not okay in my books. You massage the data enough and you can generate something that seems infringement worthy.

ticulatedspline · 4h ago

For closed models the storage problem isn't really a problem, they can be judged by what they produce not how they store it as you don't have access to the actual data. That said, open weight LLMs are probably screwed, if enough of the work remains in the weights such that they can be extracted (even if it's without even talking to the LLM) then the weight file itself represents a copy of the work that's being distributed. So enjoy these competent run-at-home models while you can, they're on track for extinction.

ninetyninenine · 3h ago

Why doesn’t this apply to humans? If I memorize something such that it can be extracted did I violate the law? It’s only if I choose to allow such extraction to occur then I’m in violation of the law right?

So if I or an LLM simply doesn’t allow said extraction to occur, memorization and copying is not against the law.

ranger_danger · 3h ago

I think an important distinction here is distribution... did you tell someone else what you memorized? Is downloading a model akin to distributing that same information?

ninetyninenine · 3h ago

What if I don't download the model and I just communicate with it. Sort of like chatting with another human. That's not a copyright issue right? I mean that's how most LLMs are deployed today.

ranger_danger · 3h ago

My understanding is that it depends on a judge/jury's subjective opinion on how similar the output is to something copyrightable. Perhaps intent may play a role as well.

ranger_danger · 3h ago

I wonder if https://en.wikipedia.org/wiki/Illegal_number comes into play here.

thedevilslawyer · 4h ago

What's the steelman case that is transformative? Because prima-facie, it seems to only output original output - "intelligent" output.

almatabata · 4h ago

If a publisher adds a "no AI training" clause to their contracts, does this ruling render it invalid?

jxdxbx · 2h ago

You don't need a license for most of what people do with traditional, physical copyrighted copies of works: read them, play a DVD at home, etc. Those things are outside the scope of copyright. But you do need a license to make copies, and ebooks generally come with licensing agreements, again because to read an ebook, you must first make a brand new copy of it. Anyway as a result physical books just don't have "licenses" to begin with and if they tried they'd be unenforceable, since you don't need to "agree" to any "terms" to read a book.

heavyset_go · 4h ago

Fair use overrides licensing

AlanYx · 3h ago

Fair use "overrides" licensing in the sense that one doesn't need a copyright license if fair use applies. But fair use itself isn't a shield against breach of contract. If you sign a license contract saying you won't train on the thing you've licensed, the licensor still has remedies for breach of contract, just not remedies for copyright infringement (assuming the act is fair use).

almatabata · 4h ago

thanks for clarifying.

bananapub · 4h ago

what contract? with who?

Meta at least just downloaded ENGLISH_LANGUAGUE_BOOKS_ALL_MEGATORRENT.torrent and trained on that.

almatabata · 4h ago

I know, but the article mentions that a separate ruling will be made about that pirating.

quote: “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages,” Judge Alsup wrote in the decision. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages.”

This tells me Anthropic acquired these books legally afterwards. I was asking if during that purchase, the seller could add a no training close to the sales contract.

shagie · 1h ago

What contracts? And would it run afoul of first sale doctrine?

https://en.wikipedia.org/wiki/First-sale_doctrine

> The doctrine was first recognized by the Supreme Court of the United States in 1908 (see Bobbs-Merrill Co. v. Straus) and subsequently codified in the Copyright Act of 1909. In the Bobbs-Merrill case, the publisher, Bobbs-Merrill, had inserted a notice in its books that any retail sale at a price under $1.00 would constitute an infringement of its copyright. The defendants, who owned Macy's department store, disregarded the notice and sold the books at a lower price without Bobbs-Merrill's consent. The Supreme Court held that the exclusive statutory right to "vend" applied only to the first sale of the copyrighted work.

> Today, this rule of law is codified in 17 U.S.C. § 109(a), which provides:

> Notwithstanding the provisions of section 106 (3), the owner of a particular copy or phonorecord lawfully made under this title, or any person authorized by such owner, is entitled, without the authority of the copyright owner, to sell or otherwise dispose of the possession of that copy or phonorecord.

---

If I buy a copy of a book, you can't limit what I can do with the book beyond what copyright restricts me.

ninetyninenine · 3h ago

Agreed. If I memorize a book and I am deployed into the world to talk about what I memorized that is not a violation of copyright. Which is reasonable logically because essentially this is what an LLM is doing.

bonoboTP · 3m ago

You can talk about it, but you can't sell tickets to an event where you recite from memory all the poems written by someone else without their permission.

LLMs may sometimes reproduce exact copies of chunks of text, but I would say it also matters that this is an irrelevant use case that is not the main value proposition that drives LLM company revenues, it's not the use case that's marketed and it's not the use case that people in real life use it for.

layer8 · 3h ago

It might be different if you are a commercial product which couldn’t have been created without incorporating the contents of all those books.

Humans, animals, hardware and software are treated differently by law because they have different constraints and capabilities.

ninetyninenine · 3h ago

But a commercial product is reaching parity with human capability.

Let's be real, Humans have special treatment (more special than animals as we can eat and slaughter animals but not other humans) because WE created the law to serve humans.

So in terms of being fair across the board LLMs are no different. But there's no harm in giving ourselves special treatment.

layer8 · 2h ago

Generative AIs are very different from humans because they can be copied losslessly and scaled tremendously, and also have no individual liability, nor awareness of how similar their output is to something in their training material. They are very different in constraints and capabilities from humans in all sorts of ways. For one, a human will likely never reproduce a book they read without being aware that that’s what they are doing.

martin-t · 1h ago

Except you can't do it at a massive scale. LLMs both memorize at a scale bigger than thousands, probably millions of humans AND reproduce at an essentially unlimited scale.

And who gets the money? Not the original author.

doctorpangloss · 4h ago

It’s similar to the Google Books ruling, which Google lost. Anthropic also lost. TechCrunch and others are very aspirational here.

philipkglass · 3h ago

Do you mean Authors Guild, Inc. v. Google, Inc.? Google won that case:

https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....

Maybe there's another big Google Books lawsuit that Google ultimately lost, but I don't know which one you mean in that case.

doctorpangloss · 1h ago

see, but if you ask a copyright attorney: Google lost. This is what I mean by aspirational. They won something, in very similar circumstances to Anthropic, "fair use," but everything else that made what they were doing a practical reality instead of purely theoretical required negotiation with Authors Guild, and indeed, they are not doing what they wanted to do, right? Anthropic has to go to trial still, they had to pirate the books to train, and they will not win on their right to commercialize the results of training, because neither did Google, so what good is the Fair Use ruling, besides allowing OpenAI v. NYTimes to proceed a little longer?

SoKamil · 4h ago

What if I overfit my LLM so it spits out copyrighted work with special prompting? Where to draw the line in training?

bonoboTP · 1m ago

If you do something else, the result may be something else. The line is drawn by the application of subjective common sense by the judge, just as it is every time.

ninetyninenine · 3h ago

I mean the human brain can memorize things as well and it’s not illegal. It’s only illegal if said memorized thing is distributed.

martin-t · 32m ago

Humans don't scale. LLMs do.

Even if LLMs were actual human-level AI (they are not - by far), a small bunch of rich people could use them to make enormous amounts of money without putting in the enormous amounts of work humans would have to.

All the while "training" (= precomputing transformations which among other things make plagiarism detection difficult) on work which took enormous amounts of human labor without compensating those workers.

mrguyorama · 2h ago

Because humans have rights

AI models do not.

NoOn3 · 5m ago

Exactly. If someone wants to compare AI models with humans, maybe then they give AI Models the right to vote and other rights.

veggieroll · 4h ago

BRB, I'm going to download all the TV shows and movies to train my vision model. Just to be sure it's working properly, I have to watch some for debugging purposes.

ncruces · 4h ago

You need to buy one copy of each for the fair use to apply.

toomuchtodo · 3h ago

Let everyone donate their DVDs and other physical media. You don’t need to buy it, you just need to possess the media.

veggieroll · 3h ago

Indeed, I forsee a "training dataset consortium" arising out of this, whereby a bunch of companies team up to buy one copy of everything and then share it for training amongst themselves (ex. by reselling the entire library to each other for $1).

toomuchtodo · 3h ago

Like an Archive? Connected to the Internet?

veggieroll · 3h ago

Genius!

bradley13 · 2h ago

Good. Reading books is legal. If I own a book and feed it to a program I wrote (and I have done exactly that), it is also legal. There is zero reason this should be any different with an AI.

thinkingtoilet · 1h ago

If you charge me to use your program and it spits out unedited, copyrighted material then it should be illegal. I don't know the details of this case, but that's what's going on in the New York Times case. It's not always so cut and dry.

paxys · 3h ago

Will be interesting to see how this affects Anthropic's ongoing lawsuit with Reddit, or all the different media publishing ones flying around. Is it okay to train on books but not online posts and articles? Why the distinction between the two?

cyanmagenta · 12m ago

The distinction will be whether those online posts were obtained legally, analogous to whether the books in this case were pirated.

It’s not as simple as it sounds, since I’m sure scraping is against Reddit’s terms and conditions, but if those posts are made publicly available without the scraper actually agreeing to anything, is that a valid breach of contract?

Will be interesting to see how that plays out.

gbacon · 4h ago

The HN crowd dislikes brick-and-mortar landlords but often sides with charging rent for certain bits. Which side will prevail?

Interesting excerpt:

> “We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages,” Judge Alsup wrote in the decision. “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for theft but it may affect the extent of statutory damages.”

Language of “pirated” and “theft” are from the article. If they did realize a mistake and purchased copies after the fact, why should that be insufficient?

MyOutfitIsVague · 4h ago

> The HN crowd dislikes brick-and-mortar landlords but often sides with charging rent for certain bits. Which side will prevail?

I don't think that's exactly the case. A lot of the HN crowd is very much against the current iterations of copyright law, but is much more against rules that they see as being unfairly applied. For most of us, we want copyright reform, but short of that, we want it to at least pretend to be used for what it is usually claimed to be for: protecting small artists from large, predatory companies.

lesuorac · 4h ago

Anthropic won't submit a spreadsheet of all the books and whether they were purchases or not. So trivially, not every book stolen is shown to be later purchased.

As just a matter of society, I don't think you want people say stealing a car and then coming back a month later with the money.

thedevilslawyer · 4h ago

While no one wants anyone to steal a car, almost no one would mind freely cloning a car. The trouble truly is that 3d-printing hasn't gotten that good yet.

layer8 · 3h ago

The car would be unlikely to exist if its maker had to expect free clones without compensation. So yes, people would mind.

johnnyanmac · 4h ago

If 3d printing was that good, stealing a car would be moot because production costs would come way down and only need to cover cost/procurement of materials and paying back the black box.

Regardless, I don't think the car is an apt metaphor here. Cars are an important utility and gatekeeping cars arguably holds society back., art is creative expression, and no one is going hungry because they didn't have $10 for the newest book.

We also have libraries already for this reason, so why not expand on that instead of relinquishing sharing of knowledge to a private corporation?

MyOutfitIsVague · 3h ago

I dislike framing art as something unimportant. Art is a vital part of being a human and part of a culture. We've grown accustomed to our culture being commoditized and rented back to us, but that doesn't mean the culture is unimportant, or such a state of affairs is acceptable.

johnnyanmac · 4h ago

>If they did realize a mistake and purchased copies after the fact, why should that be insufficient?

1. You're assuming this was some good faith "they didn't know they were stealing" factor. They use someone else's product's for commercial use. I'm not so charitable in my interpretation.

2. I'm not absolved of theft just because I go back and put money on the register. I still sttole, intentionally or not

impossiblefork · 3h ago

I think the reason it's okay to charge rent for certain bits is that the space of bitstrings is so large.

Choosing someone's bitstrings is like choosing to harvest someone's fields in a world where there's infinite space of fertile fields. You picked his, instead of finding a space in the infinite expanse to farm on your own.

If you start writing something you'll never generate a copyrighted work at random. When the work isn't available nothing is taken away from you even if you were strictly forbidden from reproducing the work.

Choosing someone's particular bitstring is only done because there's someone who has expended effort in preparing it.

PunchTornado · 4h ago

why would it erase the mistake? you pirated first.

gbacon · 3h ago

Who is the victim, and how was that person not made whole?

bgwalter · 4h ago

I have the feeling that with Alsup always the larger and more recent company wins. Google won vs. Oracle, now this.

So what is he going to do about the initial copyright infringement? Will the perpetrators get the Aaron Schwartz treatment?

UltraSane · 1h ago

If the US makes it illegal to train LLMs on copyrighted data that isn't going to stop China from doing it and give them an ENORMOUS advantage.

rsstack · 41m ago

https://news.ycombinator.com/item?id=44369227

If the US makes it illegal to train LLMs on copyrighted data, the US will find a solution and not just give up and wait half a decade to see what China does in the meantime.

UltraSane · 37m ago

What solution is there?

rsstack · 24m ago

Zillow have the MLSs network that provide them lists, a similar solution could apply if courts agree that library copies apply for this - Anthropic could sign agreements with large libraries and "check out"/"freeze" copies for a minimally-agreed-upon duration and query across all to see which has a copy of each book they need. Spotify and Apple Music sign deals en masse with labels, the same could apply here with book publishers, labels for lyrics, museums for art, etc. Or whatever other creative solution that people who will need to find, will find. Right now they took the laziest path, because it worked. They will find the next-laziest path that works.

And the easiest option: Legislation change. If it's completely decided that the current law blocks LLMs from working in the US, the industry will lobby to amend the copyright law (which is not immutable) to add a carveout for it.

You're assuming that people will just give up. People never gave up, why would they now?

josefritzishere · 3h ago

The US legal systel is bending over backwards to help AI development. The arguments border on nonsense.

shadowgovt · 51m ago

Can you offer some examples from this ruling? It seems pretty reasonable on a first read.

kmeisthax · 1h ago

I'm not sure why this alone is considered a separate issue from training the AI with books. Buying a copy of a copyrighted work doesn't inherently convey 'fair use rights' to the purchaser. If I buy a work, read it, sell it, and then publish a review or parody of it, I don't infringe copyright. Why does mere possession of an unauthorized copy create a separate triable matter before the court?

Keep in mind, you can legally engineer EULAs in such a way that merely purchasing the work surrenders all of your fair use rights. So this could wind up being effectively: "AI training is fair use for works purchased before June 24th, 2025, everything after is forbidden, here's your brand new moat OpenAI"

comex · 11m ago

The ruling suggests that "pirating a book that could have been bought at a bookstore" for the sake of "writing a book review" "is inherently, irredeemably infringing".

Which suggests that, at least in the judge's opinion, 'fair use rights' do exist in a sense, but it's about when you read the book, not when you publish.

But that's not settled precedent. Meta is currently arguing the opposite in Kadrey v. Meta: they're claiming that they can get away with torrenting training material as long as they only leech (download) and don't seed (upload), because, although the act of downloading (copying) is generally infringement under a Ninth Circuit precedent, they were making a fair use.

As for EULAs, that might be true for e-books, but publishers can't really do anything about Anthropic's new strategy of scanning physical books, because physical books generally don't come with shrinkwrap license agreements. Perhaps publishers could start adding them, but I think that would sit poorly with the public and the courts.

(That's assuming the ruling isn't overturned on appeal, which it easily might be.)

deepsun · 3h ago

Ok, so I can create a website, say, the-ai-pirate-bay.com, where I stream AI-reproduced movies. They are not verbatim, so I don't infringe any copyrights.

layer8 · 3h ago

They will infringe copyright as soon as they are sufficiently similar to the original. You can’t shoot a non-verbatim but clearly recognizable beat-by-beat remake of Star Wars, call it Galaxy Conflict, and get away with monetizing it.

shadowgovt · 49m ago

Correct.

You have to call it "Starcrash" (https://www.imdb.com/title/tt0079946/?ref_=ls_t_8). Then it's legal.

layer8 · 31m ago

Interesting artifact, but the very first/top IMDB user review convincingly contradicts that this is a Star Wars remake. ;)

Show HN: Natively – AI mobile app builder (iOS and Android) (natively.dev)

OpenAI Is Ruthless [video] (youtube.com)

Millions of Babies Have Been Saved from Murder Since Abortion Bans Insituted (catholicvote.org)

Ask HN: Help Me Find the Product

The consequences of Starbucks on startup culture in neighborhoods (thetreeoflife.cc)

The Ant Mill: How theoretical high-energy physics descended into groupthink (jespergrimstrup.substack.com)

National Archives to restrict public access starting July 7 (archives.gov)

Mexico is now Chinas No. 1 car export market (mexiconewsdaily.com)

Python Tools Are Quickly Adopting the New pylock.toml Standard (socket.dev)

The Discovery Engine (automated system for scientific discovery) (zenodo.org)

Show HN: Vybetr – Hire AI app developers using tools like Lovable, Bolt and more (vybetr.com)

Using Lxcfs Together with Podman (die-welt.net)

Lessons from LangChain and Slack and MCP Integration (medium.com)

Use of ch unit considered inappropriate (in certain circumstances) (clagnut.com)

Brit Watchdog Cracks Down on Data Collection by Smart TVs, Speakers, Air Fryers (theguardian.com)

Thoughts on the AI 2027 Discourse (dynomight.substack.com)

Childhood and Education #10: Behaviors (thezvi.substack.com)

When Can I Stop Listening to My Enemy's Points? (substack.com)

Show HN: Letter Lockbox – A word game I built over the weekend with Claude Code (letterlockbox.com)

Programmers and Their Monospace Blogs (lambdaland.org)

Ask HN: What's your fastest conversion from cold outreach to prepaid client?

Namespaced Pundit Policies Without the Repetition Racket (alec-c4.com)

The Legacy of "The Gastronomical Me" (lithub.com)

Show HN: How Usage Works (usage.ai)

Why Your Car's Touchscreen Is More Dangerous Than Your Phone (carsandhorsepower.com)

Dr. Dobb's (drdobbs.com)

Joining CNCF as Executive Director: Let's Build What's Next (cncf.io)

Elisa: A Comprehensive Guide to Enzyme-Linked Immunosorbent Assay (clyte.tech)

Secure your Express application APIs in 5 minutes with Cedar (aws.amazon.com)

Why Paris's Centre Pompidou, not even 50 years old, must close for five years (lemonde.fr)

Curated realities: An AI film festival and the future of human expression (arstechnica.com)

Scientists can now target the cells at the center of ALS (alleninstitute.org)

Haflang: Hardware Acceleration of Functional Languages (haflang.github.io)

Waldo – Geoip Lookups (geoip.dpdns.org)

David Friedberg: it is important for America that Mamdani get elected (twitter.com)

Portable Network Graphics (PNG) Specification (Third Edition) (w3.org)

EU lawmakers vote to bar carry-on luggage fees on planes (france24.com)

I Designed UX for an AI Product Last Year. Are Those Lessons Still Valid? (uxdesign.cc)

The Sun is twisting Mercury's crust in unexpected ways (bgr.com)

How to (Almost) solve cybersecurity once and for all (adaptive.live)

I Love GitOps (newsletter.masterpoint.io)

What It's Like to Be 'Mind Blind' (time.com)

Embabel: Framework for Building AI Agents with Java (thenewstack.io)

Epic Games and Qualcomm Are Bringing Fortnite to Windows 11 on Arm (thurrott.com)

Marginalia mania: how 'annotating' books went from no-no to BookTok's next trend (theguardian.com)

The AI Revolution: Human like interfaces, not intelligence (jaimefh.com)

Snyk Acquires Invariant Labs (snyk.io)

The Secret Rules of the Terminal (wizardzines.com)

Scaling Pinterest ML Infrastructure with Ray: From Training to ML Pipelines (medium.com)

Show HN: I built an AI thumbnail generator for YouTubers who can't design (thumbo.io)

A federal judge sides with Anthropic in lawsuit over training AI on books

Comments (118)