Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book

105 aspenmayer 137 6/15/2025, 11:41:53 AM understandingai.org ↗

Comments (137)

paxys · 2h ago
As an experiment I searched Google for "harry potter and the sorcerer's stone text":

- the first result is a pdf of the full book

- the second result is a txt of the full book

- the third result is a pdf of the complete harry potter collection

- the fourth result is a txt of the full book (hosted on github funny enough)

Further down there are similar copies from the internet archive and dozens of other sites. All in the first 2-3 pages.

I get that copyright is a problem, but let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

TGower · 2m ago
People aren't buying Harry Potter action figures as a subtitute for buying the book either, but copyright protects creators from other people swooping in and using their work in other mediums. There is obviously a huge market demand for high quality data for training LLMs, Meta just spent 15 billion on a data labeling company. Companies training LLMs on copyrighted material without permission are doing that as a substitue for obtaining a license from the creator for doing so in the same way that a pirate downloading a torrent is a substitue for getting an ebook license.
OtherShrezzing · 1h ago
I think the argument is less about piracy and more that the model(s output) is a derivative work of Harry Potter, and the rights holder should be paid accordingly when it’s reproduced.
psychoslave · 33m ago
The main issue on an economical point of view is that copyright is not the framework we need for social justice and everyone florishing by enjoying pre-existing treasures of human heritage and fairly contributing back.

There is no morale and justice ground to leverage on when the system is designed to create wealth bottleneck toward a few recipients.

Harry Potter is a great piece of artistic work, and it's nice that her author could make her way out of a precarious position. But not having anyone in such a situation in the first place would be what a great society should strive to produce.

Rowling already received more than all she needs to thrive I guess. I'm confident that there are plenty of other talented authors out there that will never have such a broad avenue of attention grabbing, which is okay. But that they are stuck in terrible economical situations is not okay.

The copyright loto, or the startup loto are not that much different than the standard loto, they just put so much pression on the player that they get stuck in the narrative that merit for hard efforts is the key component for the gained wealth.

paxys · 1h ago
That may be relevant in the NYT vs OpenAI case, since NYT was supposedly able to reproduce entire articles in ChatGPT. Here Llama is predicting one sentence at a time when fed the previous one, with 50% accuracy, for 42% of the book. That can easily be written off as fair use.
gpm · 1h ago
I'm pretty sure books.google.com does the exact same with much better reliability... and the US courts found that to be fair use. (Agreeing with parent comment)
pclmulqdq · 1h ago
If there is a circuit split between it and NYT vs OAI, the Google Books ruling (in the famously tech-friendly ninth circuit) may also find itself under review.
echelon · 1h ago
> Here Llama is predicting one sentence at a time when fed the previous one, with 50% accuracy, for 42% of the book. That can easily be written off as fair use.

Is that fair use, or is that compression of the verbatim source?

geysersam · 1h ago
If the assertion in the parent comment is correct "nobody is using this as a substitute to buying the book" why should the rights holders get paid?
riffraff · 57m ago
The argument is meta used the book so the LLM can be considered a derivative work in some sense.

Repeat for every copyrighted work and you end up with publishers reasonably arguing meta would not be able to produce their LLM without copyrighted work, which they did not pay for.

It's an argument for the courts, of course.

w0m · 55m ago
The argument is whether the LLM training on the copyrighted work is Fair Use or not. Should META pay for the copyright on works it ingests for training purposes?
abtinf · 1h ago
You really don't see the difference between Google indexing the content of third parties and directly hosting/distributing the content itself?
nashashmi · 32m ago
The way I see it is that an LLM took search results and outputted that info directly. Besides, I think that if an LLM was able to reproduce 42%, assuming that it is not continuous, I would say that is fair use.
imgabe · 1h ago
Hosting model weights is not hosting / distributing the content.
abtinf · 1h ago
Of course it is.

It's just a form of compression.

If I train an autoencoder on an image, and distribute the weights, that would obviously be the same as distributing the content. Just because the content is commingled with lots of other content doesn't make it disappear.

Besides, where did the sections of text from the input works that show up in the output text come from? Divine inspiration? God whispering to the machine?

imgabe · 1h ago
Have you ever repeated a line from your favorite movie or TV show? Memorized a poem? Guess the rights holders better sue you for stealing their content by encoding it in your wetware neural network.

Possibly copying the content to train the model could be infringing if it doesn't fall under fair use, but the weights themselves are not simply compressed content. For one thing, they are probabilistic, so you wouldn't get the same content back every time like you would with a compression algorithm.

vrighter · 57s ago
I have, but I never tried to make any money off of it either
bakugo · 31m ago
> Have you ever repeated a line from your favorite movie or TV show? Memorized a poem? Guess the rights holders better sue you for stealing their content by encoding it in your wetware neural network.

I see this absolute non-argument regurgitated ad infinitum in every single discussion on this topic, and at this point I can't help but wonder: doesn't it say more about the person who says it than anything else?

Do you really consider your own human speech no different than that of a computer algorithm doing a bunch of matrix operations and outputting numbers that then get turned into text? Do you truly believe ChatGPT deserves the same rights to freedom of speech as you do?

imgabe · 21m ago
Who said anything about freedom of speech? Nobody is claiming the LLM has free speech rights, which don't even apply to infringing copyright anyway. Freedom of speech doesn't give me the right to make copies of copyrighted works.

The question is whether the model weights constitute of copy of the work. I contend that they do not, or they did, than so do the analogous weights (reinforced neural pathways) in your brain, which is clearly absurd and is intended to demonstrate the absurdity of considering a probabilistic weighting that produces similar text to be a copy.

bakugo · 10m ago
> Freedom of speech doesn't give me the right to make copies of copyrighted works.

No, but it gives you the right to quote a line from a movie or TV show without being charged with copyright infringement. You argued that an LLM deserves that same right, even if you didn't realize it.

> than so do the analogous weights (reinforced neural pathways) in your brain

Did your brain consume millions of copyrighted books in order to develop into what it is today? Would your brain be unable to exist in its current form if it had not consumed those millions of books?

abtinf · 50m ago
Your first point is intentionally obtuse.

Your second point concedes the argument.

imgabe · 37m ago
No, the second point does not concede the argument. You were talking about the model output infringing the copyright, the second point is talking about the model input infringing the copyright, e.g. if they made unauthorized copies in the process of gathering data to train the model such as by pirating the content. That is unrelated to whether the model output is infringing.

You don't seem to be in a very good position to judge what is and is not obtuse.

aschobel · 20m ago
Indeed! It is a form of massive lossy compression.

> Llama 3 70B was trained on 15 trillion tokens

That's roughly a 200x "compression" ration; compared to 3-7x for tradtional lossless text compression like bzip and friends.

LLM don't just compress, they generalize. If they could only recite Harry Potter perfectly but couldn’t write code or explain math, they wouldn’t be very useful.

Zambyte · 1h ago
Where are they putting any blame on Google here?
abtinf · 1h ago
Where did I say they were?
vrighter · 2m ago
So? Am I allowed to also ignore certain laws if I can prove others have also ignored them?
BobbyTables2 · 1h ago
Indeed but since when is a blatantly derived work only using 50% of a copyrighted work without permission a paragon of copyright compliance?

Music artists get in trouble for using more than a sample without permission — imagine if they just used 45% of a whole song instead…

I’m amazed AI companies haven’t been sued to oblivion yet.

This utter stupidity only continues because we named a collection of matrices “Artificial Intelligence” and somehow treat it as if it were a sentient pet.

Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.

yorwba · 1h ago
Music artists get in trouble for using more than a sample from other music artists without permission because their work is in direct competition with the work they're borrowing from.

A ZIP file of a book is also in direct competition of the book, because you could open the ZIP file and read it instead of the book.

A model that can take 50 tokens and give you a greater than 50% probability for the 50 next tokens 42% of the time is not in direct competition with the book, since starting from the beginning you'll lose the plot fairly quickly unless you already have the full book, and unlike music sampling from other music, the model output isn't good enough to read it instead of the book.

Dylan16807 · 1h ago
> a blatantly derived work only using 50% of a copyrighted work without permission

What's the work here? If it's the output of the LLM, you have to feed in the entire book to make it output half a book so on an ethical level I'd say it's not an issue. If you start with a few sentences, you'll get back less than you put in.

If the work is the LLM itself, something you don't distribute is much less affected by copyright. Go ahead and play entire songs by other artists during your jam sessions.

colechristensen · 1h ago
>Amassing troves of copyrighted works illegally into a ZIP file wouldn’t be allowed. The fact that the meaning was compressed using “Math” makes everyone stop thinking because they don’t understand “Math”.

LLMs are in reality the artifacts of lossy compression of significant chunks of all of the text ever produced by humanity. The "lossy" quality makes them able to predict new text "accurately" as a result.

>compressed using “Math”

This is every compression algorithm.

choppaface · 1h ago
A key idea premise is that LLMs will probably replace search engines and re-imagine the online ad economy. So today is a key moment for content creators to re-shape their business model, and that can include copyright law (as much or more as the DMCA change).

Another key point is that you might download a Llama model and implicitly get a ton of copyright-protected content. Versus with a search engine you’re just connected to the source making it available.

And would the LLM deter a full purchase? If the LLM gives you your fill for free, then maybe yes. Or, maybe it’s more like a 30-second preview of a hit single, which converts into a $20 purchase of the full album. Best to sue the LLM provider today and then you can get some color on the actual consumer impact through legal discovery or similar means.

aprilthird2021 · 1h ago
> let's not pretend that an LLM that autocompletes a couple lines from harry potter with 50% accuracy is some massive new avenue to piracy. No one is using this as a substitute for buying the book.

Well, luckily the article points out what people are actually alleging:

> There are actually three distinct theories of how training a model on copyrighted works could infringe copyright:

> Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.

> The training process copies information from the training data into the model, making the model a derivative work under copyright law.

> Infringement occurs when a model generates (portions of) a copyrighted work.

None of those claim that these models are a substitute to buying the books. That's not what the plaintiffs are alleging. Infringing on a copyright is not only a matter of privacy (piracy is one of many ways to infringe copyright)

theK · 1h ago
I think that last scenario seems to be the most problematic. Technically it is the same thing that piracy via torrent does, distributing a small piece of a copyrighted material without the copyright holders consent.
paxys · 1h ago
People aren't alleging this, the author of the article is.
timeon · 17m ago
Is this whataboutism?

Anyway, it is not the same. While one points you to pirated source on specific request, other use it to creating other content not just on direct request. As it was part of training data. Nihilists would then point out that 'people do the same' but they don't as we do not have same capabilities of processing the content.

eviks · 1h ago
Let's also not pretend that "massive new" is the only relevant issue
rnkn · 1h ago
You were so close! The takeaway is not that LlmS represent a bottomless tar pit of piracy (they do) but that someone can immediately perform the task 58% better without the AI than with it. This is nothing more than “look what the clever computer can do.”
zmmmmm · 2h ago
It's important to note the way it was measured:

> the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time

As I understand it, it means if you prompt it with some actual context from a specific subset that is 42% of the book, it completes it with 50 tokens from the book, 50% of the time.

So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own. To allege a true copyright violation you'd still need to show that you can chain those together or use some other method to build actual substantial portions of the book. And if it only gets it right 50% of the time, that seems like it would be very hard to do with high fidelity.

Having said all that, what is really interesting is how different the latest Llama 70b is from previous versions. It does suggest that Meta maybe got a bit desperate and started over-training on certain materials that greatly increased its direct recall behaviour.

Aurornis · 2h ago
> So 50 tokens is not really very much, it's basically a sentence or two. Such a small amount would probably generally fall under fair use on its own.

That’s what I was thinking as I read the methodology.

If they dropped the same prompt fragment into Google (or any search engine) how often would they get the next 50 tokens worth of text returned in the search results summaries?

vintermann · 1h ago
All this study really says, is that models are really good at compressing the text of Harry Potter. You can't get Harry Potter out of it without prompting it with the missing bits - sure, impressively few bits, but is that surprising, considering how many references and fair use excerpts (like discussion of the story in public forums) it's seen?

There's also the question of how many bits of originality there actually are in Harry Potter. If trained strictly on text up to the publishing of the first book, how well would it compress it?

fiddlerwoaroof · 56m ago
The alternate here is that Harry Potter is written with sentences that match the typical patterns of English and so, when you prompt with a part of the text, the LLM can complete it with above-random accuracy
vintermann · 47m ago
Anything that can tell you what the typical patterns of English is, is going to be a language model by definition.
fiddlerwoaroof · 38m ago
My point is that this might just prove that Harry Potter is the sort of prose “fancy autocomplete” would produce and not all that original.

EDIT Actually, on rereading, I see I replied to the wrong comment.

fiddlerwoaroof · 54m ago
Or else, LLMs show that copyright and IP are ridiculous concepts that should be abolished
bee_rider · 2h ago
Even if it is recalling it 50 tokens at a time, the half of the book is in some sense in there, right?
zmmmmm · 1h ago
yeah ... it's going to depend how the issue is framed. However a "copy" of something where there is no way to practically extract the original from it probably has a pretty good argument that it's not really a "copy". For example, a regular dictionary probably has 99% of harry potter in it. Is it a copy?
vintermann · 1h ago
I'd say no. More than half of as-yet unwritten books will be in there too, because I bet will will compress text of a freshly published book much better than 50% (and newer models could even compress new books to one fiftieth of their size, which is more like that 1 in 50 tokens suggests)
bee_rider · 34m ago
That seems like a reasonably easy test to run, right? All you need is a bit of prose that was known not to have been written beforehand. Actually, the experiment could be run using the paper itself!
arthurcolle · 1h ago
You could prove this much better by looking at something like this: https://cookbook.openai.com/examples/using_logprobs
adrianN · 2h ago
Fair use is not a thing in every jurisdiction. In Germany for example there are cases where three words („wir sind Papst“) fall under copyright.
yorwba · 1h ago
Germany does not have something called "fair use," but it does have provisions for uses that are fair. For example your use of the three words to talk about their copyrighted status is perfectly legal in Germany. That somebody wasn't allowed to use them in a specific way in the past doesn't mean that nobody is allowed to use them in any way.
amanaplanacanal · 2h ago
Fair use is a four part test, and the amount if copying is only one of the four parts.
xnx · 2h ago
This sounds almost like "Works every time (50% of the time)."
hsbauauvhabzb · 2h ago
Except the odds of it happening even 50% of the time is less likely than winning the lottery multiple times. All while illegally ingesting copywrite material without (and presumably against the wishes of) the consent of the copywrite holder.
raincole · 2h ago
(Disclaimer: haven't read the original paper)

It sounds like a ridiculous way to measure it. Producing 50-token excerpts absolutely doesn't translate to "recall X percent of Harry Potter" for me.

(Edit: I read this article. Nothing burger if its interpretation of the original paper is correct.)

tanaros · 2h ago
Their methodology seems reasonable to me.

To clarify, they look at the probability a model will produce a verbatim 50-token excerpt given the preceding 50 tokens. They evaluate this for all sequences in the book using a sliding window of 10 characters (NB: not tokens). Sequences from Harry Potter have substantially higher probabilities of being reproduced than sequences from less well-known books.

Whether this is "recall" is, of course, one of those tricky semantic arguments we have yet to settle when it comes to LLMs.

fuzzbazz · 12h ago
From a quick web search I can find that there are book review sites that allow users to enter and rate verbatim "quotes" from books. This one [1] contains ~2000 [2] portions of a sentence, a paragraph or several paragraphs of Harry Potter and the Sorcerer's Stone.

Could it be plausible that an LLM had ingested parts of the book via scrapping web pages like this and not the full copyrighted book and get results similar to those of the linked study?

[1] https://www.goodreads.com/work/quotes/4640799-harry-potter-a...

[2] ~30 portions x 68 pages

paxys · 2h ago
Meta has trained on LibGen so we don't really need to speculate.

https://www.wired.com/story/new-documents-unredacted-meta-co...

aprilthird2021 · 1h ago
This is in fact mentioned and addressed in the article. Also, there is pretty clear cut evidence Meta used pirated book data sets knowingly to train the earlier Llama models
aspenmayer · 5h ago
Sure, why not? lol

https://www.reddit.com/r/DataHoarder/comments/1entowq/i_made...

https://github.com/shloop/google-book-scraper

The fact that Meta torrented Books3 and other datasets seems to be by self-admission by Meta employees who performed the work and/or oversaw those who themselves did the work, so that is not really under dispute or ambiguous.

https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...

redox99 · 2h ago
Books3 was used in Llama1. We don't know if they used it later on.
aspenmayer · 2h ago
My comparison was illustrative and analogous in nature. The copyright cartel is making a fruit of the poisonous tree type of argument. Whatever Meta are doing with LLMs is doing the heavy lifting that parity files used to do back in the Usenet days. I wouldn’t be surprised if BitTorrent or other similar caching and distribution mechanisms incorporate AI/LLMs to recognize an owl on the wire, draw the rest just in time in transit, and just send the diffs, or something like that.

The pictures are the same. All roads lead to Rome, so they say.

aprilthird2021 · 1h ago
All of the major AI models these days use "clean" datasets stripped of copyrighted material.

They also use data from the previous models, so I'm not sure how "clean" it really is

dragonwriter · 1h ago
> All of the major AI models these days use "clean" datasets stripped of copyrighted material.

Which of the major commercial models discloses its dataset? Or are you just trusting some unfalsifiable self-serving PR characterization?

pclmulqdq · 1h ago
All written text is copyrighted, with few exceptions like court transcripts. I own the copyright to this inane comment. I sincerely doubt that all copyrighted material is scrubbed.
Tepix · 1h ago
Your brief comment is hardly copyrightable. Which makes your point moot.
briffid · 26m ago
Quotation is fair use in all sensible copyright system. An LLM will mostly be able to quote anything, and should be. Quotation is not derived work. LLMs are not stealing copyrighted work. They just show that Harry Potter is in English and a mostly logical story. If someone is stabbed, they will die in most stories, that's not copyrightable. If you have an engine that knows everything, it will be able to quote everything.
gpm · 2h ago
I think it's important to recognize here that fanfiction.net has 850 thousand distinct pieces of Harry Potter fanction on it. Fifty thousand of which are more than 40k words in length. Many of which (no easy way to measure) directly reproducing parts of the original books.

archiveofourown.org has 500 thousand, some, but probably not the majority, of that are duplicated from fanfiction.net. 37 thousand of these are over 40 thousand words.

I.e. harry potter and its derivatives presumably appear a million times in the training set, and its hard to imagine a model that could discuss this cultural phenomena well without knowing quite a bit about the source material.

aprilthird2021 · 1h ago
Did you read the article? This exact point is made and then analyzed.

> Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.

> “If it were citations and quotations, you'd expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.

gpm · 1h ago
The article fails to mention or understand the volume of content here. Every, literally every, part of these books is quoted and "talked about" (in the sense of used in unlicensed derivative works).

And yes, I read the article before commenting. I don't appreciate the baseless insinuation to the contrary.

davidcbc · 1h ago
Even assuming you are correct, which I'm skeptical of, does this make it better?

It's essentially the same thing, they are copying from a source that is violating copyright, whether that's a pirated book directly or a pirated book via fanficton.

gpm · 1h ago
Generally I think it matters a great deal to get the facts right when discussing something with nuance.

Is this specific fact required to make my beliefs consistent... Yes I think it is, but if you disagree with me in other ways it might not be important to your beliefs.

Legally (note: not a lawyer) I'm generally of the opinion that

A) Torrenting these books was probably copyright infringement on Meta's part. They should have done so legally by scanning lawfully acquired copies like Google did with Google Books.

B) Everything else here that Meta did falls under the fair use and de minimis exceptions to copyrights prohibition on copying copyrighted works without a license.

And if it was copying significant amounts of a work that appeared only once in its training set into the model the de minimis argument would fall apart.

Morally I'm of the opinion that copyright law's prohibition on deeply interacting with our cultural artifacts by creating derivative works is incredibly unfair and bad for society. This extends to a belief that the communities that do this should not be excluded from technological developments because there entire existence is unjustly outlawed.

Incidentally I don't believe that browsing a site that complies with the DMCA and viewing what it lawfully serves you constitutes piracy, so I can't agree with your characterization of events either. The fanfiction was not pirated just because it was likely unlawful to produce in the US.

1123581321 · 1h ago
Agreed. It’s an obtuse quote by Lemley who can’t picture the enormous quantity of associations and crawled data, or at least wants to minimize the quantity. It’s hardly discussion-ending.

Accusations of not reading the article are fair when someone brings up a “related” anecdote that was in the article. It’s not fair when someone is just disagreeing.

choeger · 17m ago
LLMs are to a certain degree compressed databases of their training data. But 42% is a surprisingly large number.
BUFU · 26m ago
Would it be possible that other people posted content of Harry Potter book online and the model developer scrape that information? Would the model developer be at fault in this scenario?
timeon · 8m ago
I think this is good question. At least for LLMs in general. However we know that Meta used pirated torrents.
Javantea_ · 1h ago
I'm surprised no one in the comments has mentioned overfitting. Perhaps this is too obvious but I think of it as a very clear bug in a model if it asserts something to be true because it has heard it once. I realize that training a model is not easy, but this is something that should've been caught before it was released. Either QA is sleeping on the job or they have intentionally released a model with serious flaws in its design/training. I also understand the intense pressure to release early and often, but this type of thing isn't a warning.
numpad0 · 39m ago
It's apparently known among LLM researchers that the best epoch count for LLM training is one. They go through the entire dataset once, and that makes best LLMs.

They know. LLM is a novel compression format for text(holographic memory or whatever). The question is whether the rest of the world accept this technology as it is or not.

Tepix · 47m ago
I think part of the problem is that the book is in the training set multiple times
asciisnowman · 2h ago
On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer's Stone.

It's sold 120 million copies over 30 years. I've gotta think literally every passage is quoted online somewhere else a bunch of times. You could probably stitch together the full book quote-by-quote.

davidcbc · 1h ago
If I collect HP quotes from the internet and then stitch them together into a book, can I legally sell access it?
bitmasher9 · 2h ago
Probably not?

Sure there are just ~75,000 words in HP1, and there are probably many times that amount in direct quotes online. However the quotes aren’t even distributed across the entire text. For every quote of charming the snake in a zoo there will be a thousand “you’re a wizard harry”, and those are two prominent plot points.

I suspect the least popular of all direct quotes from HP1 aren’t using the quotes in fair use, and are just replicating large sections of the novel.

Or maybe it really is just so popular that super nerds have quoted the entire novel arguing about the aspects of wand making, or the contents of every lecture.

tjpnz · 56m ago
How many could do it from memory?
mvdtnz · 2h ago
But also we know for a fact that Meta trained their models on pirated books. So there's no need to invent a hare brained scheme of stitching together bits and pieces like that.
dankwizard · 2h ago
I can recall about 12% of the first Harry Potter book so it's interesting to see Llama is only 4x smarter than me. I will catch up.
hsbauauvhabzb · 2h ago
How many r’s are there in strawberry?
jofzar · 2h ago
There are 3 R's in strawberry just like in Harry Potter!
graphememes · 2h ago
I really wish we could get rid of copyright. It's going to hold us back long term.
bitmasher9 · 2h ago
We cannot get ride of it without finding a way to pay the creators that generate copyrighted works.

I’m personally more in favor of significantly reducing the length of the copy right. I think 20-30 years is an interesting range. Artist get roughly a career length of time to profit off their creations, but there is much less incentive for major corporations to buy and horde IP.

atrus · 2h ago
We barely pay creators as it is for generating copyrighted works. Nearly every copywritten work is available on the internet, for free, right now. And creators are still getting paid, albeit poorly, but that's a constant throughout history.
Tepix · 42m ago
How does that favor a longer copyright? It’s not like these old works make a lot of money (with very few exceptions). And making money after 30 years is hardly a motivating factor.
jMyles · 55m ago
I do not think it's creators that are the constituency holding up deprecation.

As a full-time professional musician, I'm convinced I'll benefit much more from its deprecation than continuing to flog it into posterity. I don't think I know any musicians who believe that IP is career-relevant for them at this point.

(Granted, I play bluegrass, which has never fit into the copyright model of music in the first place)

JoshTriplett · 2h ago
I do too. But in the meantime, as long as it continues being used against anyone, it should be applied fairly. As long as anyone has to respect software licenses, for instance, then AIs should too. It doesn't stop being a problem just because it's done at larger scale.
numpad0 · 1h ago
Sure, you just get constantly sued for obstruction of business instead, and there will be no fair use clauses, free software licenses, or right to repair to fight back. It'll be all proprietary under NDA. Is that what you want?
bradley13 · 1h ago
Many people could also produce text snippets from memory. I dispute that reading a book is a copyright violation. Copying and distributing a book, yes, but just reading it - no.

If the book was obtained legitimately, letting an LLM read it is not an issue.

riffraff · 51m ago
It is well reported that meta (and open ai and basically everyone) trained on contained obtained via piracy (LibGen).
bjornsing · 1h ago
It’s well-known that John von Neumann had this ability too:

Herman Goldstine wrote "One of his remarkable abilities was his power of absolute recall. As far as I could tell, von Neumann was able on once reading a book or article to quote it back verbatim; moreover, he could do it years later without hesitation. He could also translate it at no diminution in speed from its original language into English. On one occasion I tested his ability by asking him to tell me how A Tale of Two Cities started. Whereupon, without any pause, he immediately began to recite the first chapter and continued until asked to stop after about ten or fifteen minutes."

Maybe it’s just an unavoidable side effect of extreme intelligence?

htk · 2h ago
Hmm, couldn't this be used as a benchmark for quantization algorithms?
WhatsName · 14h ago
Given the method and how the english language works, isn't that the expected outcome for any text that isnt highly technical?

Guess the next word: Not all heros wear _____

aspenmayer · 12h ago
As there is no reason to believe that Harry Potter is axiomatic to our culture in the way that other concepts are, it is strange to me that the LLMs are able to respond in this way, and not at all expected. Why do you think this outcome is expected? Are the LLMs somehow encoding the same content in such a way that they can be prompted to decode it? Does it matter legally how LLMs are doing what they do technically? This is pertinent to the court case that Meta is currently party to.

https://en.wikipedia.org/wiki/Artificial_intelligence_and_co...

> See for example OpenAI's comment in the year of GPT-2's release: OpenAI (2019). Comment Regarding Request for Comments on Intellectual Property Protection for Artificial Intelligence Innovation (PDF) (Report). United States Patent and Trademark Office. p. 9. PTO–C–2019–0038. “Well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus”

https://copyrightalliance.org/kadrey-v-meta-hearing/

> During the hearing, Judge Chhabria said that he would not take into account AI licensing markets when considering market harm under the fourth factor, indicating that AI licensing is too “circular.” What he meant is that if AI training qualifies as fair use, then there is no need to license and therefore no harmful market effect.

I know this is arguing against the point that this copyright lobbyist is making, but I hope so much that this is the case. The “if you sample, you must license” precedent was bad, and it was an unfair taking from the commons by copyright holders, imo.

The paper this post is referencing is freely available:

https://arxiv.org/abs/2505.12546

evertedsphere · 3h ago
what is that bar (= token span) on the right common to the first three models
deafpolygon · 16h ago
It will generate a correct next token 42% of the time when prompted with a 50 token quote.

Not 42% of the book.

It's a pretty big distinction.

j16sdiz · 2h ago
next _50_ tokens 42% of the time

not just next token.

This is like: tell it a random sentence in the book, it will give you the next sentence 42% of time.

deviation · 16h ago
A... massive distinction.
asplake · 16h ago
“… well enough to reproduce 50-token excerpts at least half the time”
chiph2o · 11h ago
This means that if we start with 50% of the book then there is 42% chance that we can recreate the remaining 50%.

What is the distinction between understanding and memorization? What is the chance that understanding results in memorization (may be in case of humans)?

No comments yet

aspenmayer · 18h ago
https://archive.is/OSQt6

If you've seen as many magnet links as I have, with your subconscious similarly primed with the foreknowledge of Meta having used torrents to download/leech (and possibly upload/seed) the dataset(s) to train their LLMs, you might scroll down to see the first picture in this article from the source paper, and find uncanny the resemblance of the chart depicted to a common visual representation of torrent block download status.

Can't unsee it. For comparison (note the circled part):

https://superuser.com/questions/366212/what-do-all-these-dow...

Previously, related:

Extracting memorized pieces of books from open-weight language models - https://news.ycombinator.com/item?id=44108926 - May 2025

No comments yet

giardini · 3h ago
As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.

While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".

Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.

esafak · 3h ago
That's got nothing to do with it. It's all about copyright. Can it reproduce its training data verbatim? If so, Meta is in hot water.
strangescript · 2h ago
I read harry potter, and you ask me about a page, and I can recite it verbatim, did I just commit copyright infringement?
bitmasher9 · 2h ago
I pay for a service. The service recites a novel to me. The service would need permission to do this or it is copyright infringement.
lucianbr · 2h ago
Are you selling your ability to recite stuff? Then certainly.
strangescript · 2h ago
there are plenty of open source LLMs trained on harry potter, is that fine?
__loam · 2h ago
This is an extremely common strawman argument. We're not discussing human memory.
Jap2-0 · 2h ago
> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

To address this point, and not other concerns: the benefits would be (1) pop culture knowledge and (2) having a variety of styles of edited/reasonably good-quality prose.

alephnerd · 3h ago
> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

> Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder

Plenty of in-stealth companies approaching LLMs via this approach ;)

For those of us who studied the natural sciences and CS in the 2000s and early 2010s, there was a bit of a trend where certain PIs would simply translate German and Russian papers from the early-to-mid 20th century and attribute them to themselves in fields like CS (especially in what became ML).

epgui · 2h ago
> It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

Personally I’m assuming the worst.

That being said, Harry Potter was such a big cultural phenomenon that I wonder to what degree might one actually be able to reconstruct the books based solely on publicly accessible derivative material.

weird-eye-issue · 3h ago
Why are you talking about Claude and Anthropic?
cshimmin · 2h ago
It’s not unreasonable to suspect they are doing the same. The article starts with a description of a lawsuit NY Times brought against OpenAI for similar reasons. The big difference is that research presented here is only possible with open weight models. OAI and Anthropic don’t make the base models available, so it’s easier to hide the fact that you’ve used copyrighted material by instruction post-training. And I’m not sure you can get the logprobs for specific tokens from their APIs either (which is what the researchers did to make the figures and come up with a concrete number like 42%)
alephnerd · 1h ago
Good call! I brain farted and wrote Claude/Anthropic instead of Meta/Llama.
ninetyninenine · 3h ago
So if I memorized Harry Potter the physical encoding which definitely exists in my brain is a copyright violation?
dvt · 2h ago
> the physical encoding which definitely exists in my brain is a copyright violation

First of all, we don't really know how the brain works. I get that you're being a snarky physicalist, but there's plenty of substance dualists, panpsychsts, etc. out there. So, some might say, this is a reductive description of what happens in our brains.

Second of all, yes, if you tried to publish Harry Potter (even if it was from memory), you would get in trouble for copyright violation.

ninetyninenine · 2h ago
Right but the physical encoding already exists in my brain or how can I reproduce it in the first place? We may not know how the encoding works but we do know that an encoding exists because a decoding is possible.

My question is… is that in itself a violation of copyright?

If not then as long as LLMs don’t make a publication it shouldn’t be a copyright violation right? Because we don’t understand how it’s encoded in LLMs either. It is literally the same concept.

Jaygles · 2h ago
To me the primary difference between the potential "copy" that exists in your brain and a potential "copy" that exists in the LLM, is that you can't make copies and distribute your brain to billions of people.

If you compressed a copy of HP as a .rar, you couldn't read that as is, but you could press a button and get HP out of it. To distribute that .rar would clearly be a copyright violation.

Likewise, you can't read whatever of HP exists in the LLM model directly, but you seemingly can press a bunch of buttons and get parts of it out. For some models, maybe you can get the entire thing. And I'm guessing you could train a model whose purpose is to output HP verbatim and get the book out of it as easily as de-compressing a .rar.

So, the question in my mind is, how similar is distributing the LLM model, or giving access to it, to distributing a .rar of HP. There's likely a spectrum of answers depending on the LLM

ninetyninenine · 1h ago
> that exists in the LLM, is that you can't make copies and distribute your brain to billions of people.

I can record myself reciting the full Harry Potter book then distribute it on YouTube.

Could do the exact same thing with an LLM. The potential for distribution exists in both cases. Why is one illegal and the other not?

Jaygles · 42m ago
> I can record myself reciting the full Harry Potter book then distribute it on YouTube.

At this point you've created an entirely new copy in an audio/visual digital format and took the steps to make it available to the masses. This would almost certainly cross the line into violating copyright laws.

> Could do the exact same thing with an LLM. The potential for distribution exists in both cases. Why is one illegal and the other not?

To my knowledge, the legality of LLMs are still being tested in the courts, like in the NYT vs Microsoft/OpenAI lawsuit. But your video copy and distribution on YouTube would be much more similar to how LLMs are being used than your initial example of reading and memorizing HP just by yourself.

davidcbc · 1h ago
> I can record myself reciting the full Harry Potter book then distribute it on YouTube

Not legally you can't. Both of your examples are copyright violations

briffid · 23m ago
Recording yourself is not a violation, only publishing on Youtube. Content generated with LLMs are not a violation. Publishing the content you generated might be.
numpad0 · 1h ago
copyright is actually not as much about right to copy as it is about redistribution permissions.

if you trained an LLM on real copyrighted data, benchmarked it, wrote up a report, and then destroyed the weight, that's transformative use and legal in most places.

if you then put up that gguf on HuggingFace for anyone to download and enjoy, well... IANAL. But maybe that's a bit questionable, especially long term.

bitmasher9 · 2h ago
I don’t think the lawyers are going to buy arguments that compare LLMs with human biology like this.
lithiumii · 3h ago
You are not selling or distributing copies of your brain.
harry8 · 2h ago
If you perform it from memory in public without paying royalties then yes, yes it is.

Should it be? Different question.

JKCalhoun · 2h ago
The end of "Fahrenheit 451" set a horrible precedent. Damn you, Bradbury!
beowulfey · 2h ago
Only if you charge someone to reproduce it for them
shrewduser · 3h ago
maybe if you re wrote it from memory.
teaearlgraycold · 3h ago
I think humans get a special exception in cases like this