To be very clear on this point - this is not related to model training.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.
Taylor_OD · 1h ago
RIP to the legend. He has a lot of really fun ideas spread across his books.
florbnit · 5m ago
> Buying used copies of books, scanning them, and training on it is fine.
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research.
Training AI models for purposes other than purely academic fits into none of these.
bigmadshoe · 31s ago
Buying used copies of books, scanning them, training an employee with the scans: fair use.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
ants_everywhere · 1h ago
I wonder what Aaron Swartz would think if he lived to see the era of libgen.
klntsky · 1h ago
He died (2013) after libgen was created (2008).
ants_everywhere · 40m ago
I had no idea libgen was that old, thanks!
arcanemachiner · 1h ago
Yeah but did he die before anybody actually knew about it?
jay_kyburz · 47m ago
Is lib still around anymore. I can't find any functioning urls
I believe that there's a reddit sub that keeps people up to date with what URLs are, or are not, functioning at any given point in time
mdp2021 · 36m ago
> Buying used copies of books
It remains deranged.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
mvdtnz · 25m ago
Huh?
riquito · 19m ago
I think he implies that because one can borrow hypothetically any book for free from a library, one could use them for legal training purposes, so the requirement of having your own copy should be moot
jimmydoe · 1h ago
Google scanned many books quite a while ago, probably way more than LibGen. Are they good to use them for training?
johanyc · 1h ago
If they legally purchased them I dont think why not. IIRC they did borrow from libraries so probably not every book in Google Books
ortusdux · 56m ago
They litigated this a while ago and my understanding was that they were able to claim fair use, but I'm no expert.
What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?
mips_avatar · 1h ago
I imagine the problem there is they primarily scanned library books so I doubt they have the same copyright protections here as if they bought them
xnx · 1h ago
All those books were loaned by a library or purchased.
shortformblog · 13m ago
Thanks for the reminder that what the Internet Archive did in its case would have been legal if it was in service of an LLM.
kennywinker · 12m ago
LLM’s are turning out to be a real get-out-of-legal-responsibilities card, hey?
therobots927 · 1h ago
It is related to scalable mode training, however. Chopping the spine off books and putting the pages in an automated scanner is not scalable. And don't forget about the cost of 1) finding 2) purchasing 3) processing and 4) recycling that volume of books.
debugnik · 1h ago
I guess companies will pay for the cheapest copies for liability and then use the pirated dumps. Or just pretend that someone lent the books to them.
Onavo · 1h ago
> Chopping the spine off books and putting the pages in an automated scanner is not scalable.
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
hamdingers · 1h ago
We hem and haw about metaphorical "book burning" so much we forget that books themselves are not actually precious.
The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.
knome · 1h ago
I remember them having a 3D page unwarping tech they built as well so they could photograph rare and antique books without hacking them apart.
therobots927 · 1h ago
Oh I didn't know that. That's wild
zer00eyz · 52m ago
> It’s important in the fair use assessment to understand that the training itself is fair use,
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
Imustaskforhelp · 38m ago
Yes. Someone on this post mentioned that switzerland allows downloading copyrightable material but not distributing them.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
arcticfox · 42m ago
> And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
kennywinker · 3m ago
The example is a real legal case afaik, or perhaps paraphrased from one (don’t think it was a monkey - an ape? An elephant?).
I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.
tomrod · 42m ago
I mean, sort of. The issue is that the compression is novel. So anything post tokenization could arguably be considered value add and not necessarily derivative work.
Onavo · 1h ago
Wdym Rainbows End was prescient?
ceejayoz · 1h ago
There's a scene early on where libraries are being destructively shredded, with the shreds scanned and reconstructed as digital versions.
wmf · 55m ago
Paying $3,000 for pirating a ~$30 book seems disproportionate.
vineyardmike · 42m ago
I feel like proportionality is related also to the scale. If a student pirates a textbook, I’d agree that 100x is excessive, but this is a corporation handsomely profiting off of mass piracy.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
imron · 20m ago
> handsomely profiting
Well actively generating revenue at least.
Profits are still hard to come by.
soks86 · 46m ago
Not if 100 companies did it and they all got away.
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
vlovich123 · 42m ago
If in most cases damages cannot be recovered or the criminal will never be caught in the first place, then what is the lesson being taught? Doesn't that just create a moral hazard where you "randomly" choose who to penalize?
jdkee · 33m ago
It's about sending a message.
vlovich123 · 5m ago
The message being you’ll likely get away with it?
johnnyanmac · 23m ago
Fines should be disproportionate at this scale. So it discourages other businesses from doing the same thing.
robterrell · 46m ago
With the per-item limit for "willful infringement" being $150,000, it's a bargain.
gpm · 35m ago
And a low end of $750/item.
freejazz · 15m ago
Well it's willful infringement so a court would be entitled to add a punitive multiplier anyway. But this is something Anthropic agreed to, if that wasn't clear.
IncreasePosts · 35m ago
Realistically it will be $30 per book and $2,970 for the lawyers
gpm · 28m ago
That's not how class actions work. Ever.
In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.
nicce · 1h ago
I guess they must delete all models since they acquired the source illegally and benefitted from it, right? Otherwise it just encourages others to keep going and pay the fines later.
greensoap · 1h ago
In a prior ruling, the court stated that Anthropic didn't train on the books subject to this settlement. The record is that Anthropic scanned physical books and used those for training. The pirated books were being held in a general purpose library and were not, according to the record, used in training.
reassess_blind · 27m ago
So how did they profit off the pirated books?
nicce · 53m ago
That is something which is extremely difficult to prove from either side.
It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?
GodelNumbering · 1h ago
Settlement Terms (from the case pdf)
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
privatelypublic · 57m ago
Don't forget: NO LEGAL PRECEDENT! which means, anybody suing has to start all over. You only settle in this scenario/point if you think you'll lose.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
ignoramous · 48m ago
Or, if you think your competition, also caught up in the same quagmire, stands to lose more by battling for longer than you did?
privatelypublic · 37m ago
A valid touche! I still think google went with delaying tactics as public and other pressures forced Apple's case forward at greater velocity. (Edit: implicit "and then caved when apple lost"... because they're the same case)
manbash · 53m ago
Thank you. I assumed it would be quicker to find the link to the case PDF here, but your summary is appreciated!
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
gooosle · 50m ago
So... it would be a lot cheaper to just buy all of the books?
gpm · 16m ago
Yes, much.
And they actually went and did that afterwards. They just pirated them first.
privatelypublic · 35m ago
The permission to buy them was already settled by Google Books in the 00's.
_alternator_ · 47m ago
They did, but only after they pirated the books to begin with.
privatelypublic · 43m ago
Few. This settlement potentially weakens all challenges to the use of copyrighted works in training LLM's. I'd be shocked if behind closed doors there wasn't some give and take on the matter between Executives/investors.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
freejazz · 12m ago
That's not even remotely true. Page 4 of the settlement describes released claims which only relate to the pirating of books. Again, the amount of misinformation and misunderstanding I see in copyright related threads here ASTOUNDS.
testing22321 · 14m ago
I’m an author, can I get in on this?
A_D_E_P_T · 6m ago
I had the same question.
It looks like you'll be able to search this site if the settlement is approved:
If your work is there, you qualify for a slice of the settlement. If not, you're outta luck.
arjunchint · 1h ago
Wait so they raised all that money just to give it to publishers?
Can only imagine the pitch, yes please give us billions of dollars. We are going to make a huge investment like paying of our lawsuits.
Wowfunhappy · 1h ago
From the article:
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
slg · 1h ago
Maybe small compared to the money raised, but it is in fact enormous compared to the money earned. Their revenue was under $1b last year and they projected themselves as likely to make $2b this year. This payout equals their average yearly revenue of the last two years.
masterjack · 51m ago
I thought they were projecting 10B and said a few months ago they have already grown from a 1B to 4B run rate
privatelypublic · 11m ago
It doesn't matter if they end up in chapter 11... If it kneecaps all the other copyright lawsuits. I won't pretend to know the exact legal details. But I am (unfortunately) old enough that this isn't my first "giant corporation benefits from legally and ethically dubious copyright adjacent activities, gets sued, settles/wins." (Cough, google books)
dkdcio · 1h ago
maybe I’m bad at math but paying >5% of your capital raised for a single fine doesn’t seem great from a business perspective
ryao · 13m ago
If it allowed them to move faster than their completion, I imagine management would consider it money well spent. They are expected to spend absurd amounts of money to get ahead. They were never expected to spend money efficiently if it meant taking additional months/years to get results.
carstenhag · 6m ago
Someone here commented saying they claimed they did not even use it for training, so apparently it was useless.
siliconpotato · 1h ago
It's VC money, I don't think anyone believes it's real money
Aachen · 24m ago
If it weren't, why are we taking it as legal tender? I certainly wouldn't mind being paid in VC money
bongodongobob · 1h ago
Yeah it does, cost of materials is way more than that if they were building something physical like a new widget or something. Same idea, they paid for their raw materials.
xnx · 30m ago
The money they don't pay out in settlements goes to Nvidia.
non_aligned · 1h ago
You're joking, but that's actually a good pitch. There was a significant legal issue hanging over their heads, with some risk of a potentially business-ending judgment down the line. This makes it go away, which makes the company a safer, more valuable investment. Both in absolute terms and compared to peers who didn't settle.
freejazz · 58m ago
It just resolves their liability with regards to books they purported they did not even train the models on, which is all that was left in this case after summary judgment. Sure the potential liability was company ending, but it's all a stupid business decision when it is ultimately for books they did not even train on.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
freejazz · 1h ago
They wanted to move fast and break things. No one made them.
KTaffer · 4m ago
This was a very tactical decision by Anthropic. They have just received Series F funding, and they can now afford to settle this lawsuit.
OpenAI and Google will follow soon now that the precedent has been set, and will likely pay more.
It will be a net win for Anthropic.
r_lee · 55m ago
One thing that comes to mind is...
Is there a way to make your content on the web "licensed" in a way where it is only free for human consumption?
I.e. effectively making the use of AI crawlers pirating, thus subject to the same kind of penalties here?
gpm · 13m ago
Yes to the first part. Put your site behind a login wall that requires users to sign a contract to that effect before serving them the content... get a lawyer to write that contract. Don't rely on copyright.
I'm not sure to what extent you can specify damages like these in a contract, ask the lawyer who is writing it.
7952 · 22m ago
Maybe some kind of captcha like system could be devised that could be considered a security measure under the DMCA and not allowed to be circumvented. Make the same content available under a licence fee through an API.
Wowfunhappy · 48m ago
I'd argue you don't actually want this! You're suggesting companies should be able to make web scraping illegal.
That curl script you use to automate some task could become infringing.
Cheer2171 · 42m ago
No. Neither legally or technically possible.
shadowgovt · 43m ago
I'm sure one can try, but copyright has all kinds of oddities and carve-outs that make this complicated. IANAL, but I'm fairly certain that, for example, if you tried putting in your content license "Free for all uses public and private, except academia, screw that ivory tower..." that's a sentiment you can express but universities are under no obligation legally to respect your wish to not have your work included in a course presentation on "wild things people put in licenses." Similarly, since the court has found that training an LLM on works is transformative, a license that says "You may use this for other things but not to train an LLM" couldn't be any more enforceable than a musician saying "You may listen to my work as a whole unit but God help you if I find out you sampled it into any of that awful 'rap music' I keep hearing about..."
The purpose of the copyright protections is to promote "sciences and useful arts," and the public utility of allowing academia to investigate all works(1) exceeds the benefits of letting authors declare their works unponderable to the academic community.
(1) And yet, textbooks are copyrighted and the copyright is honored; I'm not sure why the academic fair-use exception doesn't allow scholars to just copy around textbooks without paying their authors.
GMoromisato · 12m ago
If you are an author here are a couple of relevant links:
(Sorry, meta question: how do we insert in submissions that "'Also' <link> <link>..." below the title and above the comment input? The text field in the "submit" page creates a user's post when the "url" field is also filled. I am missing something.)
petralithic · 1h ago
This is sad for open source AI, piracy for the purpose of model training should also be fair use because otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so. There is no way to buy billions of books just for model training, it simply can't happen.
bcrosby95 · 51m ago
Fair use isn't about how you access the material, its about what you can do with it after you legally access it. If you don't legally access it, the question of fair use is moot.
sefrost · 1h ago
I wonder how much it would cost to buy every book that you'd want to train a model.
GMoromisato · 26m ago
500,000 x $20 = $10 million
Obviously there would be handling costs + scanning costs, so that’s the floor.
Maybe $20 million total? Plus, of course, the time it would take to execute.
dbalatero · 1h ago
This implies training models is some sort of right.
542458 · 1h ago
No, it implies that having the power to train AI models exclusively consolidated into a handful of extremely powerful companies is bad.
JoshTriplett · 39m ago
That's true. Those handful of companies shouldn't get to do it either.
johanyc · 1h ago
No. It means model training is transformative enough to be fair use. They should just be asked to pay them back plus reimbursement/punishment, say pay 10x the price of the pirated books
golly_ned · 40m ago
> "The technology at issue was among the most transformative many of us will see in our lifetimes"
A judge making on a ruling based on his opinion of how transformative a technology will be doesn't inspire confidence. There's an equivocation on the word "transformative" here -- not just transformative in the fair use sense, but transformative as in world-changing, impactful, revolutionary. The latter shouldn't matter in a case like this.
> Companies and individuals who willfully infringe on copyright can face significantly higher damages — up to $150,000 per work
Settling for 2% is a steal.
> In June, the District Court issued a landmark ruling on A.I. development and copyright law, finding that Anthropic’s approach to training A.I. models constitutes fair use,” Aparna Sridhar, Anthropic’s deputy general counsel, said in a statement.
This is the highest-order bit, not the $1.5B in settlement. Anthropic's guilty of pirating.
Ekaros · 16m ago
Printing press, audio recording, movies, radio, television were also transformative. Did not get rid of copyright or actually brought them.
I feel it is insane that authors do not receive some sort of standard compensation for each training use. Say a few hundred to a few thousand depending on complexity of their work.
verdverm · 9m ago
Why would they earn more from models reading their works than I would pay to read it?
robterrell · 21m ago
As a published author who had works in the training data, can I take my settlement payout in the form of Claude Code API credits?
TBH I'm just going to plow all that money back into Anthropic... might was well cut out the middleman.
MaxikCZ · 1h ago
See kids? Its okay to steal if you steal more money than the fine costs.
ascorbic · 50m ago
They're paying $3000 per book. It would've been a lot cheaper to buy the books (which is what they actually did end up doing too).
ajross · 52m ago
That metaphor doesn't really work. It's a settlement, not a punishment, and this is payment, not a fine. Legally it's more like "The store wasn't open, so I took the items from the lot and paid them later".
It's not the way we expect people to do business under normal circumstances, but in new markets with new products? I guess I don't see much actually wrong with this. Authors still get paid a price they were willing to accept, and Anthropic didn't need to wait years to come to an agreement (again, publishers weren't actually selling what AI companies needed to buy!) before training their LLMs.
mhh__ · 1h ago
Maybe I would think differently if I was a book author but I can't help but think that this is ugly but actually quite good for humanity in some perverse sense. I will never, ever, read 99.9% of these books presumably but I will use claude.
on_meds · 2h ago
It will be interesting to see how this impacts the lawsuits against OpenAI, Meta, and Microsoft. Will they quickly try to settle for billions as well?
It’s not precedent setting but surely it’ll have an impact.
lewdwig · 1h ago
I’m sure this’ll be misreported and wilfully misinterpreted because of the current fractious state of the AI discourse, but given the lawsuit was to do with piracy, not the copyright-compliance of LLMs, and in any case, given they settled out of court, thus presumably admit no wrongdoing, conveniently no legal precedent is established either way.
I would not be surprised if investors made their last round of funding contingent on settling this matter out of court precisely to ensure no precedents are set.
typs · 1h ago
Maybe, though this lawsuit is different in respect to the piracy issue. Anthropic is paying the settlement because they pirated the books, not because training on copyrighted books isn’t fair use which isn’t necessarily true with the other cases.
Anthropic certainly seems to be hoping that their competitors will have to face some consequences too:
>During a deposition, a founder of Anthropic, Ben Mann, testified that he also downloaded the Library Genesis data set when he was working for OpenAI in 2019 and assumed this was “fair use” of the material.
Per the NYT article, Anthropic started buying physical books in bulk and scanning them for their training data, and they assert that no pirated materials were ever used in public models. I wonder if OpenAI can say the same.
SlowTao · 1h ago
That was my first though. While not legal precedent, it does sort of open the flood gates for others.
nottorp · 1h ago
I thought 1.5 B is the penalty for one torrent, not for a couple million torrents.
At least if you're a regular citizen.
taftster · 56m ago
Make sure to grab the mother-of-all-torrents I guess if you're going to go that path. That way you get more bang for your 1.5B penalty.
ipaddr · 37m ago
A million torrents would cost 1,500 each.
ggm · 3m ago
... in one economy and for specific authors and publishers. But the offence is global in impact on authors worldwide, and the consequences for other IPR laws remains to be seen.
qqbooks · 8m ago
So if a startup wants to buy book PDFs legally to use for AI purposes, any suggestions on how to do that?
jarjoura · 11m ago
I'm excited for the moment where these models are able to treat using copyrighted work in a fair-use way that pays out to authors the way Spotify does when you listen to a song. Why? Because authors recieving royalties for their works when they get used in some prompt would likely encourage them to become far more accepting towards LLMs.
Also passing on the cost to consumers of generated content since companies now would need to pay royalties on the back-end should also likely increase the cost of generating slop and hopefully push back against that trend.
This shouldn't just be books, but all written content, like scholarly journals and essays, news articles and blogs, etc.
I realize this is just wishful thinking, but there's got to be some nugget of aspirational desire to pay it forward.
novok · 1h ago
I wonder who will be the first country to make an exception to copyright law for model training libraries to attract tax revenue like Ireland did for tech companies in the EU. Japan is part of the way there, but you couldn't do a common crawl type thing. You could even make it a library of congress type of setup.
tonfa · 54m ago
As long as you're not distributing, it's legal in Switzerland to download copyrighted material. (Switzerland was on the naughty US/MPAA list for a while, might still be)
Imustaskforhelp · 44m ago
Is it distribution though if someone trains a model in switzerland through downloading copyrighted material, training AI on it and then distributing it...
Or what if not even distributing it but rather distributing the outputs of the LLM (so closed source LLM like anthropic)
I am genuinely curious as to if there is some gray area that might be exploited by AI companies as I am pretty sure that they don't want to pay 1.5B dollars yet still want to exploit the works of authors. (let's call a spade a spade)
HDThoreaun · 27m ago
Using copyrighted material to train AI is a legal grey zone. The nyt vs openAI case is litigating this. The anthropic settlement here is about how the material is obtained. If openAI wins their case and switzerland rules the same way I dont think there would be a problem
markasoftware · 1h ago
They also agreed to destroy the pirated books. I wonder how large of a portion of their training data comes from these shadow libraries, and if AI labs in countries that have made it clear they won't enforce anti-piracy laws against AI companies will get a substantial advantage by continuing to use shadow libraries.
somanyphotons · 1h ago
Perhaps they'll quickly rent the whole contents of a few physical libraries and then scan them all
lxe · 4m ago
A terrible precedent that guarantees China a win in the AI race
bhaktatejas922 · 33m ago
This weirdly seems like its the best mechanism to buy this much data.
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
daemonologist · 4m ago
I suspect you could acquire and scan every readily purchasable book for much less than $3k each. Scanhouse for instance charges $0.15 per page for regular unbound (disassembled) books, plus $0.25 for supervised OCR, plus another dollar if the formatting is especially complex; this comes out to maybe $200-300 for a typical book. Acquiring, shipping, and disposing of them all would of course cost more, but not thousands more.
The main cost of doing this would be the time - even if you bought up all the available scanning capacity it would probably take months. In the meantime your competition who just torrented everything would have more high-quality training data than you. There are probably also a fair number of books in libgen which are out of print and difficult to find used.
mgraczyk · 27m ago
It's a tiny amount of data relatively speaking. Much more expensive per token than almost any data source imaginable
mooreds · 57m ago
Anyone have a link to the class action? I published a book and would love to know if I'm in the class.
hetspookjee · 50m ago
Deep research on Claude perhaps for some irony if you will.
nextworddev · 39m ago
Wait, I’m a published author, where’s my check
gpm · 4m ago
The court has to give preliminary approval to the settlement first. After that there should be a notice period during which the lawyers will attempt to reach out and tell you what you need to do to receive your money. (Not a lawyer, not legal advice).
For legal observers, Judge William Haskell Alsup’s razor-sharp distinction between usage and acquisition is a landmark precedent: it secures fair use for transformative generative AI while preserving compensation for copyright holders. In a just world, this balance would elevate him to the highest court of the land, but we are far from a just world.
$1.5B is a nothing but a handslap for the big gold rush companies.
It's less than 1% Anthropic's valuation -- a valuation utterly dependent on all the hoovering up of others' copyrighted works.
AFAICT, if this settlement signals that the typical AI foundation model company's massive-scale commercial theft doesn't result in judgments that wipe out a company (and its execs), then we have confirmation that is a free-for-all for all the other AI gold rush companies.
Then making deals to license rights, in sell-it-to-us-or-we'll-just-take-it-anyway deals, becomes only a routine and optional corporate cost reduction exercise, but not anything the execs will lose sleep over if it's inconvenient.
ianks · 18m ago
There’s alternatives to wiping out the company that could be fair. For example, a judgement resulting in a shares of the company or revenue shares in the future rather than a one time pay off.
Writers were the true “foundational” piece of LLMs, anyway.
neilv · 11m ago
If this is an economist idea of fair, where is the market?
If someone breaks into my house and steals my valuables, without my consent, then giving me stock in their burglary business isn't much of a deterrent to them and other burglars.
Deterrence/prevention is my real goal, not the possibly get a token settlement from whatever bastard rips me off.
We need the analogue of laws and police, or the analogue of homeowner has a shotgun.
xnx · 28m ago
> It's less than 1% Anthropic's valuation
The settlement is real money though. Valuation is imaginary.
Now how about Meta and their questionable means of acquiring tons of content?
tomrod · 28m ago
Maybe it's time to get some Llama models copied before an overzealous court rules badly.
rvz · 2h ago
> A trial was scheduled to begin in December to determine how much Anthropic owed for the alleged piracy, with potential damages ranging into the hundreds of billions of dollars.
It has been admitted and Anthropic knew that this trial would totally bankrupt them had they said they were innocent and continued to fight the case.
But of course, there's too much money on the line, which means even though Anthropic settled (admitting guilt and profiting off of pirated books) they (Anthropic) knew there was no way they could win that case, and was not worth taking that risk.
> The pivotal fair-use question is still being debated in other AI copyright cases. Another San Francisco judge hearing a similar ongoing lawsuit against Meta ruled shortly after Alsup's decision that using copyrighted work without permission to train AI would be unlawful in "many circumstances."
The first of many.
f33d5173 · 1h ago
If it was a sure thing, then the rights holders wouldn't have accepted a settlement deal for a measly couple billion. Both sides are happier to avoid risking losing the suit.
Ekaros · 11m ago
Also knowing how pro corporate the legal system is piercing the veil and going after everyone holding the stock would have been unlikely. So getting 1,5 billion out of them likely could have been reasonable move. Otherwise they could have just burned all the money and flipped what was leftover to someone else, with uncertain price and horizon.
Robotbeat · 1h ago
Wait, DID they admit guilt? A lot of times companies settle without admitting guilt.
deafpolygon · 1h ago
Honestly, this is a steal for Anthropic.
unaut · 1h ago
This settlement I guess could be a landmark moment. $1.5 billion is a staggering figure and I hope it sends a clear signal that AI companies can’t just treat creative work as free training data.
typs · 1h ago
I mean the ruling does in fact find that treating this particular kind of creative work qualifies as fair use.
HDThoreaun · 24m ago
All the ai companies are still using books as training data. Theyre just finding the cheapest scanned copies they can get their hands on to cover their ass
thinkingtoilet · 1h ago
Great. Which rich person is going to jail for breaking the law?
emtel · 39m ago
No one, rich or poor, goes to jail for downloading books.
mdp2021 · 6m ago
Are you sure? I think in some jurisdictions they would, according to the law.
missedthecue · 41m ago
This isn't a criminal case so zero people of any financial position would end up in prison.
non_aligned · 1h ago
I'm gonna say one thing. If you agree that something was unfairly taken from book authors, then the same thing was taken from people publishing on the web, and on a larger scale.
Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
varenc · 15m ago
Books aren't hosted publicly online free for anyone to access. The court seems to think buying a book and scanning it is fair use. Just using pirated books is forbidden. Blogs weren't accessed via pirating.
taejavu · 57m ago
The blogger’s content was freely available, this fine is for piracy.
non_aligned · 53m ago
This is not a fine, it's a settlement to recompense authors.
More broadly, I think that's a goofy argument. The books were "freely available" too. Just because something is out there, doesn't necessarily mean you can use it however you want, and that's the crux of the debate.
ascorbic · 42m ago
It's not the crux of this case. This is a settlement based on the judge's ruling that they books had been illegally downloaded. The same judge said that the training itself was not the problem – it was downloading the pirated books. It will be tough to argue that loading a public website is an illegal download.
emtel · 39m ago
But you can use copyrighted works for transformative works under the fair-use doctrine, and training was ruled to be fair use in the previous ruling.
ascorbic · 45m ago
The settlement was for downloading the pirated books, not training from them. Unless they're paywalled it would be hard to argue the same for a blog.
It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.
Buying used copies of books, scanning them, and training on it is fine.
Rainbows End was prescient in many ways.
Agreed. Great book for those looking for a read: https://www.goodreads.com/book/show/102439.Rainbows_End
The author, Vernor Vinge, is also responsible for popularizing the term 'singularity'.
Buying used copies of books, scanning them, and printing them and selling them: not fair use
Buying used copies of books, scanning them, and making merchandise and selling it: not fair use
The idea that training models is considered fair use just because you bought the work is naive. Fair use is not a law to leave open usage as long as it doesn’t fit a given description. It’s a law that specifically allows certain usages like criticism, comment, news reporting, teaching, scholarship, or research. Training AI models for purposes other than purely academic fits into none of these.
Unless legislation changes, model training is pretty much analogous to that. Now of course if the employee in question - or the LLM - regurgitates a copyrighted piece verbatim, that is a violation and would be treated accordingly in either case.
https://www.reddit.com/r/libgen/comments/1n4vjud/megathread_...
It remains deranged.
Everyone has more than a right to freely have read everything is stored in a library.
(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").
What I'm wondering is if they, or others, have trained models on pirated content that has flowed through their networks?
That's how Google Books, the Internet Archive, and Amazon (their book preview feature) operated before ebooks were common. It's not scalable-in-a-garage but perfectly scalable for a commercial operation.
The books that are destroyed in scanning are a small minority compared to the millions discarded by libraries every year for simply being too old or unpopular.
I think that this is a distinction many people miss.
If you take all the works of Shakespeare, and reduce it to tokens and vectors is it Shakespeare or is it factual information about Shakespeare? It is the latter, and as much as organizations like the MLB might want to be able to copyright a fact you simply cannot do that.
Take this one step further. IF you buy the work, and vectorize it, thats fine. But if you feed it in the vectors for Harry Potter so many times that it can reproduce half of the book, it becomes a problem when it spits out that copy.
And what about all the other stuff that LLM's spit out? Who owns that. Well at present, no one. If you train a monkey or an elephant to paint, you cant copyright that work because they aren't human, and neither is an LLM.
If you use an LLM to generate your code at work, can you leave with that code when you quit? Does GPL3 or something like the Elastic Search license even apply if there is no copyright?
I suspect we're going to be talking about court cases a lot for the next few years.
So things get even more dark because what becomes distribution can have a really vague definition and maybe the AI companies will only follow the law just barely, just for the sake of not getting hit with a lawsuit like this again. But I wonder if all this case did was maybe compensate the authors this one time. I doubt if we can see a meaningful change towards AI companies attitude's towards fair use/ essentially exploiting authors.
I feel like that they would try to use as much legalspeak as possible to extract as much from authors (legally) without compensating them which I feel is unethical but sadly the law doesn't work on ethics.
This seems too cute by half, courts are generally far more common sense than that in applying the law.
This is like saying using `rails generate model:example` results in a bunch of code that isn't yours, because the tool generated it according to your specifications.
I’d guess the legal scenario for `rails generate` is that you have a license to the template code (by way of how the tool is licensed) and the template code was written by a human so licensable by them and then minimally modified by the tool.
It’s crazy to imagine, but there was surely a document or slack message thread discussing where to get thousands of books, and they just decided to pirate them and that was OK. This was entirely a decision based on ease or cost, not based on the assumption it was legal. Piracy can result in jail time IIRC, so honestly it’s lucky the employee who suggested this, or took the action avoided direct legal liability.
Oh and I’m pretty sure other companies (meta) are in litigation over this issue, and the publishers knew that settlement below the full legal limit would limit future revenue.
Well actively generating revenue at least.
Profits are still hard to come by.
This is to teach a lesson because you cannot prosecute all thieves.
Yale Law Journal actually writes about this, the goal is to deter crime because in most cases damages cannot be recovered or the criminal will never be caught in the first place.
In this specific case the settlement caps the lawyer fees at 25%, and even that is subject to the courts approval. In addition they will ask for $250k total ($50k / plaintiff) for the lead plaintiffs, also subject to the courts approval.
It is 500,000 books in total so did they really scan all those books instead of using the pirated versions? Even when they did not have much money in the early phases of the model race?
1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.
2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.
3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.
Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.
Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:
> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”
Even if true, I wonder how many cases we will see in the near future.
And they actually went and did that afterwards. They just pirated them first.
A settlement means the claimants no longer have a claim, which means if they're also part of- say, the New York Times affiliated lawsuit- they have to withdraw. A neat way of kneecapping a country wide decision that LLM training on copy written material is subject to punitive measures don't you think?
It looks like you'll be able to search this site if the settlement is approved:
> https://www.anthropiccopyrightsettlement.com/
If your work is there, you qualify for a slice of the settlement. If not, you're outta luck.
Can only imagine the pitch, yes please give us billions of dollars. We are going to make a huge investment like paying of our lawsuits.
> Although the payment is enormous, it is small compared with the amount of money that Anthropic has raised in recent years. This month, the start-up announced that it had agreed to a deal that brings an additional $13 billion into Anthropic’s coffers. The start-up has raised a total of more than $27 billion since its founding in 2021.
It basically does nothing for them besides that. Given the split decisions so far, I'm not sure what value the Alsup decision is going to bring to the industry, moving forward, when it's in the context of books that Anthropic physically purchased. The other AI cases are generally not fact patterns where the LLM was trained with copyrighted materials that the AI company legally purchased copies of.
OpenAI and Google will follow soon now that the precedent has been set, and will likely pay more.
It will be a net win for Anthropic.
Is there a way to make your content on the web "licensed" in a way where it is only free for human consumption?
I.e. effectively making the use of AI crawlers pirating, thus subject to the same kind of penalties here?
I'm not sure to what extent you can specify damages like these in a contract, ask the lawyer who is writing it.
That curl script you use to automate some task could become infringing.
The purpose of the copyright protections is to promote "sciences and useful arts," and the public utility of allowing academia to investigate all works(1) exceeds the benefits of letting authors declare their works unponderable to the academic community.
(1) And yet, textbooks are copyrighted and the copyright is honored; I'm not sure why the academic fair-use exception doesn't allow scholars to just copy around textbooks without paying their authors.
You can search LibGen by author to see if your work is included. I believe this would make you a member of the class: https://www.theatlantic.com/technology/archive/2025/03/searc...
If you are a member of the class (or think you are) you can submit your contact information to the plaintiff's attorneys here: https://www.anthropiccopyrightsettlement.com/
Obviously there would be handling costs + scanning costs, so that’s the floor.
Maybe $20 million total? Plus, of course, the time it would take to execute.
A judge making on a ruling based on his opinion of how transformative a technology will be doesn't inspire confidence. There's an equivocation on the word "transformative" here -- not just transformative in the fair use sense, but transformative as in world-changing, impactful, revolutionary. The latter shouldn't matter in a case like this.
> Companies and individuals who willfully infringe on copyright can face significantly higher damages — up to $150,000 per work
Settling for 2% is a steal.
> In June, the District Court issued a landmark ruling on A.I. development and copyright law, finding that Anthropic’s approach to training A.I. models constitutes fair use,” Aparna Sridhar, Anthropic’s deputy general counsel, said in a statement.
This is the highest-order bit, not the $1.5B in settlement. Anthropic's guilty of pirating.
I feel it is insane that authors do not receive some sort of standard compensation for each training use. Say a few hundred to a few thousand depending on complexity of their work.
TBH I'm just going to plow all that money back into Anthropic... might was well cut out the middleman.
It's not the way we expect people to do business under normal circumstances, but in new markets with new products? I guess I don't see much actually wrong with this. Authors still get paid a price they were willing to accept, and Anthropic didn't need to wait years to come to an agreement (again, publishers weren't actually selling what AI companies needed to buy!) before training their LLMs.
It’s not precedent setting but surely it’ll have an impact.
I would not be surprised if investors made their last round of funding contingent on settling this matter out of court precisely to ensure no precedents are set.
https://www.tomshardware.com/tech-industry/artificial-intell...
>During a deposition, a founder of Anthropic, Ben Mann, testified that he also downloaded the Library Genesis data set when he was working for OpenAI in 2019 and assumed this was “fair use” of the material.
Per the NYT article, Anthropic started buying physical books in bulk and scanning them for their training data, and they assert that no pirated materials were ever used in public models. I wonder if OpenAI can say the same.
At least if you're a regular citizen.
Also passing on the cost to consumers of generated content since companies now would need to pay royalties on the back-end should also likely increase the cost of generating slop and hopefully push back against that trend.
This shouldn't just be books, but all written content, like scholarly journals and essays, news articles and blogs, etc.
I realize this is just wishful thinking, but there's got to be some nugget of aspirational desire to pay it forward.
Or what if not even distributing it but rather distributing the outputs of the LLM (so closed source LLM like anthropic)
I am genuinely curious as to if there is some gray area that might be exploited by AI companies as I am pretty sure that they don't want to pay 1.5B dollars yet still want to exploit the works of authors. (let's call a spade a spade)
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
The main cost of doing this would be the time - even if you bought up all the available scanning capacity it would probably take months. In the meantime your competition who just torrented everything would have more high-quality training data than you. There are probably also a fair number of books in libgen which are out of print and difficult to find used.
You can follow the case here: https://www.courtlistener.com/docket/69058235/bartz-v-anthro...
You can see the motion for settlement (what the news article is about) here: https://storage.courtlistener.com/recap/gov.uscourts.cand.43...
https://www.anthropic.com/news/anthropic-raises-series-f-at-...
It's less than 1% Anthropic's valuation -- a valuation utterly dependent on all the hoovering up of others' copyrighted works.
AFAICT, if this settlement signals that the typical AI foundation model company's massive-scale commercial theft doesn't result in judgments that wipe out a company (and its execs), then we have confirmation that is a free-for-all for all the other AI gold rush companies.
Then making deals to license rights, in sell-it-to-us-or-we'll-just-take-it-anyway deals, becomes only a routine and optional corporate cost reduction exercise, but not anything the execs will lose sleep over if it's inconvenient.
Writers were the true “foundational” piece of LLMs, anyway.
If someone breaks into my house and steals my valuables, without my consent, then giving me stock in their burglary business isn't much of a deterrent to them and other burglars.
Deterrence/prevention is my real goal, not the possibly get a token settlement from whatever bastard rips me off.
We need the analogue of laws and police, or the analogue of homeowner has a shotgun.
The settlement is real money though. Valuation is imaginary.
It has been admitted and Anthropic knew that this trial would totally bankrupt them had they said they were innocent and continued to fight the case.
But of course, there's too much money on the line, which means even though Anthropic settled (admitting guilt and profiting off of pirated books) they (Anthropic) knew there was no way they could win that case, and was not worth taking that risk.
> The pivotal fair-use question is still being debated in other AI copyright cases. Another San Francisco judge hearing a similar ongoing lawsuit against Meta ruled shortly after Alsup's decision that using copyrighted work without permission to train AI would be unlawful in "many circumstances."
The first of many.
Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
More broadly, I think that's a goofy argument. The books were "freely available" too. Just because something is out there, doesn't necessarily mean you can use it however you want, and that's the crux of the debate.