Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book

14 zdw 5 6/16/2025, 1:44:58 AM understandingai.org ↗

Comments (5)

evertedsphere · 6m ago
what is that bar (= token span) on the right common to the first three models
giardini · 28m ago
As I've said several times, the corpus is key: LLMs thus far "read" most anything, but should instead have well-curated corpora. "Garbage In, Garbage Out!(GIGO)" is the saying.

While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere. Leave Harry Potter for a different "Harry Potter LLM".

Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder.

alephnerd · 20m ago
> While the Harry Potter series may be fun reading, it doesn't provide information about anything that isn't better covered elsewhere

It has copyright implications - if Claude can recollect 42% of a copyrighted product without attribution or royalties, how did Anthropic train it?

> Train scientific LLMs to the level of a good early 20th century English major and then use science texts and research papers for the remainder

Plenty of in-stealth companies approaching LLMs via this approach ;)

For those of us who studied the natural sciences and CS in the 2000s and early 2010s, there was a bit of a trend where certain PIs would simply translate German and Russian papers from the early-to-mid 20th century and attribute them to themselves in fields like CS (especially in what became ML).

ninetyninenine · 1m ago
So if I memorized Harry Potter the physical encoding which definitely exists in my brain is a copyright violation?
weird-eye-issue · 4m ago
Why are you talking about Claude and Anthropic?