Let me make a clarifying statement since people confuse (purposely or just out of ignorance) what violating copyright for AI training can refer to:
1. Training AI on freely available copyright - Ambiguous legality, not really tested in court. AI doesn't actually directly copy the material it trains on, so it's not easy to make this ruling.
2. Circumventing payment to obtain copyright material for training - Unambiguously illegal.
Meta is charged with doing the latter, but it seems the plaintiffs want to also tie in the former.
dragonwriter · 2h ago
> Circumventing payment to obtain copyright material for training - Unambiguously illegal.
The judge in this case seems to disagree with you, not accepting the premise that downloading the material from pirate sites for this use inherently gets the plaintiffs an out from having to address fair use defense as to the actual use.
> the plaintiffs want to also tie in the former.
No, the defense wants to and the judge hasn't let the plaintiffs avoid it the way you argue they automatically can.
aidenn0 · 2h ago
> The judge in this case seems to disagree with you, not accepting the premise that downloading the material from pirate sites for this use inherently gets the plaintiffs an out from having to address fair use defense as to the actual use.
This is a good point, as a reminder, the Folsom tests (failing or passing any one is not conclusive, they are to be holistically considered) are:
- the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes (Note also that whether or not the use is transformative is part of this test).
- the nature of the copyrighted work
- the amount and substantiality of the portion used in relation to the copyrighted work as a whole
- the effect of the use upon the potential market for or value of the copyrighted work
If the former ever gets tested in court, it's the end of the road. All major AI companies have trained on copyrighted work, one way or another.
What is inspiration? What is imitation? What is plagiarism? The lines aren't clearly drawn for humans... much less for LLMs.
ekidd · 3h ago
> If the former ever gets tested in court, it's the end of the road. All major AI companies have trained on copyrighted work, one way or another.
I can absolutely guarantee you that neither DeepSeek nor Alibaba's highly talented Qwen group will care even a little bit, in the long run. Not if there's value to be had in AI. (And I can tell you down to the dollar what LLMs can save in certain business use cases.)
If the US decides to unilaterally shut down LLMs, that just means that the rest of the world will route around us. Whether this is good or bad is another question.
BobaFloutist · 44m ago
Or AI companies could use some of their vast reserves of cash to pay for licensing agreements and pay people for their fucking intellectual property, then feed it to the beast.
But then they'd have to actually communicate with people and negotiate consent instead of just hoovering up everything they can get their hands on in their quest to replace it.
flessner · 2h ago
The pattern hasn't changed in decades. Remember when ZTE copied Cisco's router code so precisely they included the same bugs and documentation typos?
LLMs are a drop on a hot stone compared to countless other factors why the world already is routing around the US - but I don't want to get political or economical.
shrubhub · 1h ago
What's the point of being proud of one system of government if you're willing to relinquish it in the face of adversary?
Shouldn't they have to follow the law?
vkou · 54m ago
> If the US decides to unilaterally shut down LLMs, that just means that the rest of the world will route around us.
You're talking as if they are some kind of nationalized or publically-owned asset, as opposed to a bunch of for-profit, privately-owned silos.
Sophira · 47m ago
Local models are a thing, though. You can run DeepSeek on your local computer.
Even if ChatGPT, Huggingface, etc. died, we would still have the models and we would still be able to run them.
jamiek88 · 2h ago
> And I can tell you down to the dollar what LLMs can save in certain business use cases.)
Please do!!
theturtletalks · 3h ago
China found the perfect way to disrupt US tech, releasing open source versions of it for free or at least cheaper. Most of US tech is built on open source anyways and with the pace YC is investing in open source alternatives, it will win out in most niches.
My fear is that the US tech won’t be able to compete with state sponsored open source out of China and will move to ban open source or suppress it somehow.
ekidd · 2h ago
Also, the Chinese work is legit. DeepSeek introduced a whole bag of new techniques like GRPO, and released quite a bit of good open source tooling.
And Alibaba's Qwen team seems to be quite genuinely talented at "small" models, 32B parameters and below. Once you get Qwen3 properly configured, it punches well above its "weight class." I'm still running real benchmarks, but subjectively, it feels like the 32B model performs somewhere between 4o-mini and 4o on "objectively measureable" tasks. It's a little "stodgy" and formal by default, though. We'll see what it looks like when people start fine-tuning it.
If the US dropped off the planet, it would maybe set LLM technology back a year.
theturtletalks · 2h ago
Deepseek really changed how people think about Chinese tech. Even after new LLMs launched, Deepseek R1 and V3 hold their own on benchmarks and are significantly cheaper.
serial_dev · 2h ago
You point to Chinese companies disregarding any rules if there is value to be had in AI, while in the US, AI companies going to get 500 billion investment and a whistleblower is dead.
US AI companies will either make sure that a similar ruling will never be made or they will ignore it and pay the fines. They won't let anybody stop the gravy train.
dragonwriter · 2h ago
> If the former ever gets tested in court, it's the end of the road. All major AI companies have trained on copyrighted work, one way or another.
You assume that getting tested means the AI trainers lose, and also thar the model architectures that have been developed can’t be retrained from scratch with public domain, owned, and purpose-licensed material. (With several AI companies having been actively pursuing deals to license content for AI training for a while now.)
diggan · 2h ago
> If the former ever gets tested in court, it's the end of the road. All major AI companies have trained on copyrighted work, one way or another.
End of the road for major AI companies, and hopefully something better can be created once it's declared illegal without any murky waters.
There are LLMs trained on data that isn't illegally obtained, OLMo by Ai2 is one such model, that is actually open source and uses open data for training. Just because it's "very difficult" for OpenAI et al shouldn't be an argument to force them to behave ethically anyways. If they cannot survive acting legally, then so be it, sucks for them.
nradov · 2h ago
That would hardly be the end of the road. If copyright enforcement gets stricter then that will give a market advantage to the largest, best funded major AI companies like OpenAI because they can afford to simply buy licenses from copyright holders. I predict that we'll see new middlemen arise specifically to handle this licensing, much like the agencies that handle most music licensing today.
nickpsecurity · 3h ago
The FairTrained models claim to train with only public domain and legal works. Companies are also licensing works. This company has a lawful, foundation model:
So, it's really the majority of companies breaking the law who will be affected. Companies using permissible and licensed works will be fine. The other companies would finally have to buy large collections of content, too. Their billions will have go to something other than GPU's.
bilbo0s · 3h ago
I don't know?
Not really sure a claim is good enough. I don't know that you can just go into court and say, "Trust me, I don't use copyrighted material."
And I also can't see any way, other than providing training data and training an identically structured model on that data, that a company can conclusively show that they got the weights in an allegedly copyright free model from the copyright free training data a company provides.
317070 · 3h ago
I do hope people are still innocent until proven guilty?
If you did not use copyrighted materials for training, people will not be able to prove that you did, and that should be good enough.
lelanthran · 2h ago
> I do hope people are still innocent until proven guilty?
It's a civil matter not a criminal matter so that that doesn't apply.
nickpsecurity · 2h ago
While the others are correct, I'm with you in the sense that I don't know if what they claim is true. I've also found others, like one in Singapore, that didn't use it on data that was as legal as news reports claimed. It might turn out to have problems.
There is benefit to using them, though. For one, they've tried really hard to be legal. That sets a positive example, shows good faith if they were sued, and reduces risk for those using them (good faith on our part). Also, one can be sure that they can ditch or replace any outputs in the long term if they're ruled illegal. So, we try not to use the A.I.'s in a way where losing access to them seriously damages our business.
That's the best I can offer until legal reforms happen.
If training, one can train it in Singapore on material you he or she has legal access to. Their law pretty much let's you use anything for AI purposes so long as you legally can access it yourself. To further reduce the risk, they should crawl it themselves, too, taking care to avoid risky sources.
hobs · 3h ago
Civil courts work by you proving damages (at least in the USA), not by you going on fishing expeditions because they "might" have done something.
So good luck finding the thing that looks exactly like your copyrighted work that's not in the corpus, if you can yeah, you might be able to prove it.
At the end of the day its like a lot of business, where a liability shell game is played out, and if the chain of evidence cant be drawn quite brightly then lawsuits would be frivolous at best.
vkou · 2h ago
If corporations owned human slaves and fed them copyrighted materials so that they were inspired to produce original creative output, I don't think that creative output should enjoy legal protections either. Even if slavery were not illegal.
Because the obvious question would be - how can free people compete with that?
Yizahi · 2h ago
The whole point of the fair use clauses is to protect humans. Clearly we can easily say that programs are altogether exempt in favor of humans, and it would be a proper thing to do, until the first real AI is built.
triceratops · 3h ago
> AI doesn't actually directly copy the material it trains on
Of course it does. Large models are trained on gigantic clusters. How can you train without copying the material to machines in the cluster?
thethimble · 3h ago
“Copy” is ambiguous here. Of course data is copied during training. That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.
xyzzy_plugh · 3h ago
Why does it have to be verbatim? Seriously, this I don't understand.
If I produce a terrible shakycam recording of a film while sitting in a movie theater, it's not a verbatim copy, nor is it even necessarily representative of the original work -- muddied audio, audience sounds, cropped screen, backs of heads -- and yet it would be considered copyright infringement?
How many times does one need to compress the JPEG before it's fair use? I'm legitimately curious what the test is here.
stevenAthompson · 2h ago
The purpose of copyright is it progress the arts and sciences. Not to guarantee profit. Guaranteeing profit is just the way we encourage people to progress the arts and sciences.
That is why so called derivative works are allowed (and even encouraged). If copyrighted material is ingested, modified or enhanced to add value, and then regurgitated that is legal, whereas copying it without adding value is not legal.
If derivative works weren't deemed acceptable copyright would have the opposite of it's intended effect and become an impediment to progress.
moregrist · 1h ago
> That is why so called derivative works are allowed (and even encouraged).
Derivative works are not given a free pass from the normal constraints of copyright. You cannot legally publish books in the universe of A Song of Ice and Fire without permission from the author (and often publisher), calling them “derivative works.”
It’s why fan fiction is such a gray area for copyright and why some publishers have historically squashed it hard.
The exceptions for this are typically fair use, which requires multi-factor analysis by the judiciary and is typically decided on a case-by-case basis.
throwaway290 · 1h ago
Derivative works are not generally allowed in many jurisdictions. Try releasing a cover song without clearing it first etc. Even using a recognizable sample will bite you
Derivative works are tolerated in some cases like some manga or fanfics but it is a gray area and whenever the author or publisher wants to pursue it it is their full right to do it. Many do pursue it
(You can get inspired by something, and this is where some arguments can happen if you get inspired too mmmm literally, but no one will say with a straight face that inspiration is a thing that happens to software)
moregrist · 1h ago
> Try releasing a cover song without clearing it first etc. Even using a recognizable sample will bite you
So… it’s complicated. This is one of the weird areas where music copyright and other copyright seem to differ in the US.
In the US the situation is complex and there are a lot of weird special interests [0], but generally a composer/author of a song has the right to decide who first records and releases the song, but after the first recording covers require a mechanical license, which is compulsory (ie: the author cannot object).
In music there are _a lot_ of special cases and different rights are decided with different kinds of licenses, some of which are compulsory. I think it’s an area that doesn’t make for good analogies with copyright in other media.
You are absolutely right. I should have phrased that differently. Derivative work is a legal term, but I misused it above. I should have either used another term or been clearer.
If the work is "derivative" in the legal sense it is copyrighted, and you may not create derivative works without the copyright holders permission.
What I should have said is that simply being inspired by a work or copying unprotectable elements (like facts or ideas) does not create a derivative work.
For example, if ChatGPT were to generate Star Wars, except with Dookies instead of Wookies, that might be illegal. If it were to learn what a spaceship is from Star Wars and then create something substantially new it would not. The key is is that it must not be substantially similar to the original. You must add enough value that it becomes something new, not just rehash the original.
throwaway290 · 1h ago
Learning is like inspiration, it's something humans do. Don't get fooled by "learning" in "machine learning", it doesn't mean machine becomes like human in legal sense...
__loam · 1h ago
You do not understand fair use lol
nilsbunger · 3h ago
There’s something called a substantive transformation test in copyright law. When you write a summary of a book, you don’t infringe on copyright because it’s a “substantial transformation”. This goes along with the idea that you can copyright the text but not the ideas it expresses.
When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.
TheOtherHobbes · 2h ago
No transformation is needed.
The point here is that book files have to be copied before they can be used for training. Copyright texts typically say something like "No unauthorised copying or transmission in any form (physical, electronic, etc.)"
Individuals who torrented music and video files have been bankrupted for doing exactly this.
The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.
If this is enforced (still to be seen...) it would be financially catastrophic for Meta, because there are set damages for works that have been registered for copyright protection - which most trad-pubbed books, and many self-pubbed books, are.
leovander · 2h ago
> have been bankrupted for doing exactly this.
Only if they seeded the data and some other entity downloaded it, i.e. they hosted the data. In a previous article I believe it was called out that Meta was being a leecher (not seeding back what they downloaded).
It's the hosting that gets you, not the act of downloading it.
stevenAthompson · 1h ago
I would like to expand on this, since it seems to be a common misunderstanding. Lets imagine a hypothetical situation where one friend loans a book to another, who then makes a copy of it.
The lender owns the book, and it is within his rights to loan it to whoever he wants. That is legal. Making this illegal would end libraries.
The borrower is well within his rights to accept the book, and as the current owner he is even allowed to make a copy of the book (see the famous TIVO case). Making this illegal would end backups and format/time shifting.
When the borrower returns the book, he keeps the copy. Oh no! Surely he must now become a criminal? Nope. Possessing an unauthorized copy is also not illegal, despite what many copyright holders would like you to believe. Making this illegal would also criminalize a lot of legitimate format/time shifting, again see the famous TIVO case.
If the borrower were to loan his homemade copy to someone else THEN it would finally become illegal.
Nothing about AI changes any of this.
leovander · 1h ago
Don't read too much into what I am saying. I am not even talking about the AI piece.
I download a torrent with movie that I didn't pay for. If I don't allow to seed it, then I don't get in trouble. If I let it seed either during the download process or after, I'd get a DMCA notice if that torrent/magnet link was getting tracked.
I don't need a hypothetical book, that is just how it works if I were to download illegally obtained documents/media.
As technical as people are in this thread, easy to tell when folks didn't have their parents wondering why they were getting scary letters from the ISP.
triceratops · 1h ago
Do you have any case law (other than Tivo or VHS time-shifting) that relates directly to books?
throwaway290 · 1h ago
If you made a durable copy of a book in your example to keep for yourself and use later that's already a grey area. But no one does it with books. People do it with other media tho, and big surprise get prosecuted for it. As you may know, in developed countries people get served notices for torrenting
But if you make books contents available online via some service that regurgitates its contents you would be totally sus because you can be considered in a business of selling derivative works.
jayd16 · 2h ago
This is a leap in the argument. We've gone from the right to use a work to "unless the result is identical or close to it, we have full rights to all works.".
Seems like a big gap there.
spwa4 · 1h ago
It's COPYright. It has to be very close to the original to be covered by copyright. Hence the name.
__loam · 1h ago
They copied the work when they made the training set.
ivell · 1h ago
My understanding is copyright is about distribution rights and not making a copy. Seeding falls under distribution.
triceratops · 1h ago
Your understanding is incorrect.
mrgoldenbrown · 2h ago
Even if you argue the LLM's are merely summarizing content, they still had to illegally download that content in the first place. The model can't read and simmarize the texts unless the text was illegally downloaded and copied. Piracy isn't suddenly legal just because you promise to delete the movie you downloaded after watching it.
triceratops · 2h ago
The counterargument to that is model training is impossible without making copies. That's not true for humans.
Workaccount2 · 2h ago
That's not really true. Models train (in greatly simplified way) by being shown an excerpt and being told to guess the next token from the excerpt. They push around their weights until the token they output matches the next token in the excerpt. Then the excerpt is no longer needed. You can think of it like the article is loaded, the LLM plays this token guessing game through it, then the article is discarded. On the face of it this is what happens, but it gets hairier depending on how exactly this process is done. But it is seemingly not far removed from how humans consume content (acquire, read, discard), hence the legal blur.
codedokode · 1h ago
It is different thing. When you copy data into computer's RAM, that might be copying as defined in law [1]:
> Using software almost always involves creating copies, even though many of these copies only exist for a very short time. For example, executing a program means copying it from the hard disk into RAM so that the CPU can interpret the instructions. Because of this, the right to run a program is considered to fall under the copyright of the author.
For comparison, when a human looks at the letters, there is no copying.
Also, models can reproduce text verbatim which proves that they store it.
So it is unfair when ordinary folks got sued for this and Zuckerberg wants to get away with a million times larger violation. He must go directly to jail.
> by being shown an excerpt [of copyrighted material]
How is this done? Are bits not written into RAM or disk? Are they not sent between machines in a training cluster? That's copying.
> it is seemingly not far removed from how humans consume content
Except that humans don't make full copies to RAM, or disk or paper.
Workaccount2 · 1h ago
The is a bar of usage built into the law, otherwise everyone who reads this wired article is violating copyright by making a full copy to their computer. Generally making non-lasting copies is fine, otherwise the internet wouldn't work.
AI doesn't need lasting copies to train, however I don't know what the actual implementation is. But if it's ruled that they can only use copyrighted data if it's not stored for more than the time it would take a human to consume, It wouldn't really cripple the models, but perhaps make training more logistically challenging.
It's important to understand that models are not data archives. They are statistical constructs made from getting quizzed, that uses human made content to generate the quiz questions.
triceratops · 1h ago
> otherwise everyone who reads this wired article is violating copyright by making a full copy to their computer
Wired explicitly sent that article to their computer for the purposes of reading it so it's not a copyright violation.
spwa4 · 1h ago
> Except that humans don't make full copies to RAM, or disk or paper.
Images on your retina form exact copies.
They are scanned and translated into impulses that are then sent to a first set of "neural columns" - that's an exact copy.
This is then connected to the visual cortex by the two most high bandwidth links in the human body ("the optical nerve", there's 2 of them of course, always wondered why everybody insists on using the singular). Why would you have that high bandwidth link unless to create verbatim copies.
The way those columns are structured also very strongly suggests they make carbon copies, which they then make available on the "brain bridge" (which is probably at least vaguely similar to the "attention matrix" of a transformer). If it does work like that, that's also a verbatim copy.
The only way "humans don't make full copies to RAM" is that humans don't have separate RAM. The processing power is colocated with the processing, even on a microscopic level. You know, what everybody knows is the best way of doing things even in silicon, it's just incredibly impractical if you can't rebuild your circuit every time there's a slight change to the instructions your "computer" carries out (the brain is not a "Von Neumann architecture", except it kind of is when it regrows connections. But in the short term it isn't)
triceratops · 1h ago
> that's an exact copy.
Not for the purposes of copyright law.
> is that humans don't have separate RAM [or disk]
And that turns out to be incredibly important. Humans can't create a lasting, shareable copy of a copyrighted work by consuming it.
realusername · 2h ago
It's also true for humans, you memorize only parts of what you read and see but you still had to view the whole thing first.
The computer model is working differently of course but functionally it's the same idea.
__loam · 1h ago
God I hate this conversation so much. These cases have nothing to do with how the brain works.
amelius · 1h ago
> I'm legitimately curious what the test is here.
The test is if a judge says it is fair use, nothing else.
The judge will take into account the human factor in this matter, e.g. things like who did the actual work, and who just used an algorithm (which is not the hard part anymore, code can be obtained on the internet for free). And we all know that DL is nowhere without huge amounts data.
meta_ai_x · 2h ago
completely different scenarios. A pirated movie is marketed/sold as a copy of something, which is not fair use. An LLM just remembers/get inspired by what it consumes
xyzzy_plugh · 2h ago
I don't believe that's correct. The existence of filters to block potentially copyrighted materials contained in LLM outputs proves that they don't just "get inspired."
It seems like it is very much a matter of fidelity.
crystal_revenge · 1h ago
> An LLM just remembers/get inspired by what it consumes
As mentioned in another comment, LLMs (and most popular machine learning algorithms) can be viewed, correctly, as compression algorithms which leverage lossy encoding + interpolation to force a kind of generalization.
Your argument is that a video wouldn't count as pirated if the compression used for the pirated copy was lossy (or at least sufficiently lossy). The closest real world example would be the cases where someone records a the filming of a movie on their phone then uploads it. Such a copy is lossy enough that you can't produce anything really like the original, but my most definitions is still considered copyright.
Workaccount2 · 1h ago
They are in no way compression algorithms. They can be modeled like that in the same way you can model humans as lossy compression algorithms.
You would never use a human to backup your financial reports, but the human might be able to give a good overview. You would never use an LLM to backup your financial reports, but they might be able to give a good overview.
AI training data is disposable. There is nothing that could be called a compression algorithm that disposes all of the data you put into it. AI uses training data as examples of what the next token in a token sequence is. The examples are disposable reference points, not the model itself. That's how you get image models that are 20GB in size despite training on 20PB of data. It's 20PB of examples used to form the shape of a 20GB model. You could show it 5GB of training data or 500EB of training data and it would still be 20GB - because it is not a compression algo, it's a 20GB shape formed by external data.
crystal_revenge · 1h ago
> They are in no way compression algorithms.
I'm sorry, but this a fundamentally incorrect view of machine learning (including, but not limited to transformers).
From an information theoretic perspective the two are essentially identical with the exception that standard compression algorithms do not have a proper "loss" function other than just trying to minimize reconstruction loss with the resulting compression size.
Here's a link to the section on the Wikipedia for more information if you'd like [0]. MacKay's Information Theory, Inference and Learning Algorithms is the standard full text treatment of this topic [1]. Ted Chiang's article "ChatGPT is a Blurry JPEG of the web" is pretty good "pop sci" exploration of this topic if you don't want to get too into the mathematics [2].
>They can be modeled like that in the same way you can model humans as lossy compression algorithms
Humans are totally capable of data compression. This will just devolved into a semantics game of what a data compressor is.
LLMs were not developed to be, do not function as, and are not use as data compression utilities. Please, come knocking when a service provider exists that will use LLM's to compactly store your company data.
zerd · 2m ago
> It has long been established that predictive models can be transformed into lossless
compressors and vice versa. Incidentally, in recent years, the machine learning
community has focused on training increasingly large and powerful self-supervised
(language) models. Since these large language models exhibit impressive predictive
capabilities, they are well-positioned to be strong compressors. In this work, we
advocate for viewing the prediction problem through the lens of compression and
evaluate the compression capabilities of large (foundation) models. We show
that large language models are powerful general-purpose predictors and that the
compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively.
Transformers are also used in the top algorithm right now on the Large Text Compression Benchmark. https://bellard.org/nncp/nncp.pdf
crystal_revenge · 43m ago
> LLMs were not developed to be, do not function as, and are not use as data compression utilities.
Again, from a information theoretic view point, this is exactly what they are doing, how they where developed and how they function.
I don't know any serious researcher in ML that would find this claim even remotely controversial. It's really not just "a semantics game", its a part of a foundational understanding of the topic. If you want to understand LLMs from this perspective, a good place to start is with an auto-encoder which does try to learn a standard compression algorithm, the move on to more sophisticated embedding models (found in a lot of recommender systems) which try to learn an additional objective on top of minimizing reconstruction error. You'll then see that Transformers and all other major NN architectures fall out of these basic principles.
> Please, come knocking when a service provider exists that will use LLM's to compactly store your company data.
This is literally what every vectordb company does right now, as well as all "chat with your docs" type startups.
Workaccount2 · 15m ago
VectorDBs are not LLMs or SQL replacements and RAG is not data compression. Again this is just going to dwindle into semantics and where one draws boundaries. I can randomly remove bits from my HDD and call it compression. If you think humans are data compressors then I have no argument.
Can you get transformers to regurgitate information verbatim? Yes.
Would anyone in their right mind rely on a transformer to do so? No.
Would anyone in their right mind rely on a vectorDB to do so? No.
Would anyone in their right mind use a vectorDB/RAG/SQL/transformer combo to do so? Yes.
Is youtube going to drop VP9 for GeminiEncode to save google billions in bandwidth? No.
simion314 · 1h ago
>It's 20PB of examples used to form the shape of a 20GB model. You could show it 5GB of training data or 500EB of training data and it would still be 20GB - because it is not a compression algo, it's a 20GB shape formed by external data.
You can compress 20PB of text to 20Gb or even less, if input is super repetitive.
So the same with images, if 50% of the images are cats then you learn how to represent the cat pixels with a few vectors and then you could represent all the cats int he world doing all possible cat actions.
But please have the courage to respond to this,
when the AI is caught regurgitating the exact text from a popular book, the exact verses from a poem, the exact code function from some code , then how can you defend that is not memorizing things? If a human uses my poem(after they read it) and signs his name under it would you defend them?
Workaccount2 · 46m ago
The point is that it isn't compression. Its molding a plain structure iteratively into a ultra complex one. The model starts and ends at 20GB. It might have features that are reminiscent of compression or act like it, but under the hood there is nothing like zip, rar, H.265, or JPEG going on.
And yes LLMs can recall exact material, but it is excerpts and fragments. There is statistical significance to it's ordering. Humans readily do this too (excerpts and fragments), most artists can draw a batman symbol (but not an episode of batman). That doesn't in anyway mean that artists should not be allowed to ever see a batman symbol. It means that artists shouldn't be allowed to get paid to draw one. And they are not. And LLMs are not exempt either.
But the fix is output filtering, just like everything else that can violate copyright. Which is already being done (albeit poorly, but way better than 2 years ago), the same as artists will not draw the batman symbol for you despite being able to.
simion314 · 34m ago
OK, so we name it something different, you transform inputs into smaller outputs.
If I make a script without AI that transforms someones poems without permission so sometimes it outputs the exact poems but sometimes it does it wrong, when is my script fair to use and when what I did is illegal. say my script contains words and matrixes of numbers the original poems are not directly inside, the script transformed them into vectors.
maybe even simpler, I create a zip format where I randomly replace words with their synonym, or group of words with something equivalent. Would you defend this as original ? why my random transformations are not original while mathematical transformations you will defend ?
And how can you suggest putting output filters to protect only the giants for copyright and everyone else gets screwed.
Workaccount2 · 8m ago
YouTube actively filters small channel copied content too.
We should be building robust copyright filters and everyone should be able to contribute their work to it.
*but that is a different issue than whether or not an LLM is legally allowed to view a work that is publicly available."
Again, pretty much every artist is capable of off-hand copyright violation on the spot. This has been true forever. We don't bar them from seeing art to prevent this.
realusername · 2h ago
It's the same as a book or a film review, you can't get the film or the book back from it but the original material is still needed to produce it.
Needing the original material isn't enough for claiming copyright infringement as we have existing counter examples
__loam · 1h ago
Movie reviews are fair use because they don't compete with the original work.
ilikehurdles · 3h ago
If you read a book and later understand its plot but can only explain it in your own words, did you copy it?
The model isn’t storing the book.
mcny · 3h ago
> If you read a book and later understand its plot but can only explain it in your own words, did you copy it?
I think that is the center of the conversation.
What does it mean for a computer to "understand"?
If I wrote some code that somehow transformed the text of the book
and never return the verbatim text but somehow modified the output,
I would likely not be spared
because the ruling will likely be my transformation is "trivial".
Personally, I think we have several fixes we need to make:
1. Abolish the CFAA.
2. Limit copyright to a maximum of 5 years from date of production with no extension possible for any reason.
3. Allow explicit carveout in copyright for transformational work. Explicitly allow format shifting, time shifting, yada yada.
4. Prohibit authors and publishers from including the now obviously false statements like "No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording" bla bla bla in their works.
5. I am sure I am missing some stuff here.
For brand protection, we already have trademark law. Most readers here already know this but We really should severe the artificial ties we have created between patents, trademarks, and copyright.
The model doesn't "understand its plot". So I am not sure this is a good analogy.
ilikehurdles · 3h ago
To what extent connections in a neural network are analogous to connections between neurons in your brain is open to interpretation and study, but the point of the analogy is that in neither case is a copy being made.
amlib · 2h ago
I can arrange a series of bricks in many ways to try and build a wall but that doesn't mean I will automatically get a good result if my process (like a ML training algorithm) doesn't precisely arrange then in a manner that produces a rigid wall with the desired characteristics. In the same vein you can have a fancy neural network arranged by some fancy LLM training algorithm with gobs of data about a subject but current methods likely won't produce anything with the depth of "understanding" that a human can do. It's a crumbly wall that falls once you do any real inspection or put any real load into it.
jamiek88 · 2h ago
Yeah but a copy IS made. A human just reads. The machine copies the full text then compresses a lossy copy in its weights. You keep dodging that with tortuous analogies of a human learning.
I’m sure all these ‘clever’ questions would be useful if this trial was about humans but it’s not.
Workaccount2 · 2h ago
Model training works roughly by feeding the model a text excerpt and then hiding the last word in the excerpt. The model is then asked to "guess" what the final word is. It will then move around it's weights until the guess sufficiently matches the actual token. Then the process repeats.
The training material is used to play this guessing game to dial in it's weights. The training data is picked up, used as reference material for the game, and then discarded. It's hard to place this far from what humans do when reading, because both are using the information to mold their respective "brains" and both are doing an acquire, analyze, discard process.
At no point is training data actually copied into the model itself, it's just run past the "eyes" of the model to play the training game.
__loam · 1h ago
This trial has nothing to with how the brain works and even if they did work the same, humans obviously have different legal rights than a computer.
triceratops · 2h ago
You didn't copy it because you're a human. A computer can't "read" without copying. It's how it works.
freejazz · 2h ago
Great question for a different litigation actually involving humans.
waynesonfire · 3h ago
I just happen do read the Phoenix Technologies wikipedia page a few days ago. This company is known for developing BIOS software for computers. Maybe you've seen their logo when you first turn on your computer.
In early computing, everything was closed sourced. Quoting the wikiepdia page,
To develop a legal BIOS, Phoenix used a clean room design. Engineers read the BIOS source listings in the IBM PC Technical Reference Manual. They wrote technical specifications for the BIOS APIs for a single, separate engineer—one with experience programming the Texas Instruments TMS9900, not the Intel 8088 or 8086—who had not been exposed to IBM BIOS source code.
The legal team at Phoenix deemed inappropriate to "recall source in their own words" for legal reasons.
My non-legal intuition is that these companies training their models are violating copyright. But, the stakes are too high--it's too big to fail if you will. If we don't do it, then our competitors will destroy us. How do you reconcile that?
int_19h · 2h ago
You're right that if we want to have usable LLMs at all, there's no way around training them on copyrighted materials. So it has to be allowed, but in a way that compensates the original authors somehow. For example, every model provider has to publicly declare all works used for training, and then all inference providers offering that model have to collect a per-token tax that gets distributed to authors in proportion to their presence in the dataset (by the by, this could also be a way to fund websites like Wikipedia).
But any such arrangement needs to be hammered out by the legislature. As laws are, I think it's pretty clear that infringement is happening.
SoftTalker · 2h ago
Perhaps Phoenix just looked at the potential adversary (IBM) and decided to approach the project in an exceedingly cautious way, knowing that IBM could litigate it forever if there were any plausible argument that they "copied" even a line of code.
HWR_14 · 3h ago
"Of course data is copied during training" is copying. As far as I know, the law is consistent that temporary copies are also covered by the copyright act, and that's how some analogous cases were resolved.
Dylan16807 · 2h ago
Temporary copies are in the scope of copyright law, yes. But also, you are allowed to make them. Or reading a book via a computer would be illegal.
triceratops · 2h ago
> But also, you are allowed to make them.
Not of physical media. You're allowed to make archival copies of digital media.
> Or reading a book via a computer would be illegal
No you purchased a license (or your library did, in the case of e-borrowing) to read the book on a computer. That makes it legal.
Dylan16807 · 2h ago
I am allowed to point a webcam at my physical book and read off the screen, even though that makes digital copies of all the text.
triceratops · 2h ago
FWIW the essence of copyright law AIUI is: copying is not permitted, unless done in a form explicitly allowed by the license holder.
This scenario seems quite contrived but is there an actual court precedent allowing it? I'm 100% confident no one will ever prosecute you for doing it but that's not the same thing as "allowed".
This case was about pointing <recording equipment> at <content someone is allowed to access> and making a copy to transmit it to <only that person>. The Supreme Court held it to be illegal. There was a lot of money on the line, which is why it went so far.
Dylan16807 · 2h ago
As far as I can tell there is also a principle that you can have ephemeral copies as part of handling and processing the work.
Your example involves transmission and mine doesn't, and that's a whole different can of worms.
Also the result of that case was self-contradicting so it's not a great basis to build too much logic upon.
triceratops · 2h ago
> there is also a principle that you can have ephemeral copies as part of handling and processing the work.
I'm not aware of this principle. Where is it spelled out?
> Also the result of that case was self-contradicting
I agree the verdict was a travesty. An innovative business went to ridiculous lengths to stay on the right side of the copyright mafia (data centers with tiny individual TV antennas for each subscriber FFS!) while providing a better product and experience. They still had the hammer brought down on them.
Dylan16807 · 2h ago
> I'm not aware of this principle. Where is it spelled out?
Well, do CDs give you a license agreement that allows you to copy the data? I've never seen one. And it's nearly impossible to play a CD without that copying.
triceratops · 2h ago
> And it's nearly impossible to play a CD without that copying.
Exactly. It's how you are supposed to use the CD. That's not true for your example of a book on a webcam. You're supposed to read the book, not an image of the book.
That's not the same thing as copying for "processing".
Dylan16807 · 2h ago
Well you said explicit license earlier.
Would it be a violation to play back a record like a CD and have a digital buffer? That would be pretty silly.
triceratops · 57m ago
> explicit license
Selling you the CD explicitly grants you the right to play it using a CD player.
> Would it be a violation to play back a record like a CD and have a digital buffer?
I genuinely have no idea lol.
shawabawa3 · 2h ago
If I buy a book I'm free to print as many copies as I want inside my house
It becomes illegal if I try to distribute those copies
So the question is, does distributing an AI that has been trained on Harry Potter count as distributing Harry Potter?
triceratops · 2h ago
This is not correct. It's only true that no one will go to the effort of prosecuting you for keeping photocopies of books in your home. But copyright law doesn't allow you to do it.
OtherShrezzing · 3h ago
>That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.
The NYTimes in 2023 was able to demonstrate that the models can reproduce entire articles verbatim[0] with minimal coercion.
Perhaps this is not evidence that the NY Times article was copied, but that what the NY TImes writes is highly predictable.
OtherShrezzing · 57m ago
The obvious test for this would be to have the models produce an article from before and after its cutoff date and see if the output is still verbatim.
It would be a remarkable quirk of statistics that, if given all text on the internet except for the NYTimes back catalogue, a model would produce any NYT article.
dragonwriter · 1h ago
> That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.
While a tool being used to create infringing copies of some other work (whether or not it is the source material used to create the tool, and whether or not the infringing material is also verbatim copies) is relevant to whether the tool vendor is liable for contributory infringement for the infringing use of the tool, the absence of a capacity for creating such copies isn't usually enough to say that copying to make the tool isn't infringing.
(That said, generative AI tools, including LLMs specifically, have been shown to have the capacity to make such copies, to the extent that vendors of hosted models are now putting additional checks on output to try to mitigate the frequency with which verbatim copies of substantial portions of training-set works are produced, so arguing that LLMs can't do that is silly.)
jayd16 · 2h ago
So if they could produce verbatim segments, that would be a violation? The technology is certainly there and these companies need to work backwards to prevent that.
moomin · 2h ago
A good way of thinking about this is: consider the case where the data in question is illegal. Could you get into trouble for not only having access to it but also making copies of it?
There’s plenty of case law there…
lsaferite · 27m ago
I would argue that as an individual, real person, obtaining content without a license and personally consuming that content is significantly different than a corporation doing the same. My rational is that distribution of that content is (or should be) the primary offense. If I work for a company and they direct me to collect a bunch of content without a license and then I pass that to other members in my team to train a model, I've now distributed that content at the direction of my employer. That should be the offense the company is tried for.
Using content to train an LLM is not copying the content. I'm ignoring the silly "but actually" arguments about the content being in RAM so it's "copying". It's using the content to generate a statistical model of token (word-ish) relationships and probabilities. If you write content that is so original in it's wording and I train an LLM against it, then there is certainly the possibility that the LLM could be provoked the recall the exact words you used. You'd have to set the parameters just right to make it happen and I think that proper training would drastically lower if not remove that possible scenario. But even if it doesn't, the LLM doesn't have a copy of that original content. All it has is weights representing those relationship probabilities. Yes, the minutia is more complex, but that is the essence. If my LLM were to generate enough of this essentially verbatim unique content and I tried to publish or copyright it, then I as the user should be on the hook. But then you get into a discussion about how many words in a unique sequence does it take to be infringement?
Obviously, I am not a lawyer.
My summation in all of this is that new laws need to be put into place to handle this stuff because the existing ones are sufficiently non-definitive and/or ill-suited such that every party is forming strong opinions about how old laws apply to new situations and causing massive friction.
crystal_revenge · 2h ago
> That said, OP is referring to whether the resulting model is able to produce verbatim copies of the data.
Transformers are fundamentally large compression algorithms where the target of compression is not just to minimize reconstruction loss + compressed file size. In fact, basically all of machine learning used today can be viewed through the lens of learning a compression algorithm with added goals other than the usual.
By this logic if I create a lossy Jpeg of a copyrighted image it's not "copying" because the lossy compression.
kergonath · 3h ago
That copying is already a violation. At least it was when regular people weee on the receiving end of the lawsuits.
superkuh · 2h ago
The US Federal government operates with the rule that if human eyes don't look at it it doesn't count as a copy or looking at it. This allows them to unconstitutionally spy and log all people's telecommunications. Applying it here it seems pretty clear that corps are within the established bounds. As are any human persons that want to train an LLM this way.
triceratops · 2h ago
Someone engaged in large-scale unconstitutional spying does not give two fs about incidentally doing some copyright violations to achieve the spying. These are entirely orthogonal considerations.
triceratops · 3h ago
Copyright is the right to make copies. Why is copying during training is any different from producing copies of training data after training?
If we're going that way, let me torrent every movie and TV show ever to "train" myself.
cgriswald · 3h ago
Copyright is defined in law and as the original poster stated, whether this is 'copying' as defined by copyright law is legally ambiguous.
Copyright doesn't protect against all forms of duplication. For instance, you own the copyright to your post and grant HN a license to offer copies of it. I have no direct license from you to copy the content of your post; but I can copy it to memory, copy a cache to disk, and make a copy appear on my display.
kergonath · 2h ago
> For instance, you own the copyright to your post and grant HN a license to offer copies of it.
It’s not a good example, because if you grant a license you give them the right to make copies. The problem is not when Meta got licenses, it’s when they did not.
Dylan16807 · 2h ago
This line of conversation is not specific to the pirated books but is making claims about AI training in general.
ants_everywhere · 3h ago
You can train yourself with every book at any library. You can also train yourself on a large number of movies and TV shows for a small monthly fee.
Where your analogy goes wrong is you're saying you want to "[Circumvent] payment to obtain copyright material for training" to use Workaccount2's words.
triceratops · 2h ago
So Meta borrowed every book from a library and paid to obtain all of the movies and TV shows? They kept only one copy of every book at any time on their system?
Because I'm certainly not allowed to photocopy a library book in its entirety. And I guarantee you a Netflix subscription doesn't allow me to keep a copy of a movie on my hard drive and use it for training man or machine.
throwawaymaths · 2h ago
> Because I'm certainly not allowed to photocopy a library book in its entirety.
IANAL but that probably falls under fair use? You'll get in trouble if you photocopy the work and sell access to it.
3036e4 · 1h ago
Depends on where you live. In Sweden you can make a few copies of almost anything without violating copyright. There are a few exceptions. Copying entire books was added as an exception in 2005. You can still copy parts of a book. How large parts? I don't know, but I once asked for a copy from a library and they said that a few chapters was fine, so maybe that much (I am not a lawyer).
triceratops · 55m ago
> but that probably falls under fair use?
I've not found case law for that. I've had this same argument on HN multiple times over the past few months.
throwawaymaths · 3h ago
It depends on your license? I mean strictly speaking if you stream a video you purchase legally over say amazon prime, there's lots of "copying" happening at various levels after those bits leave the data center.
triceratops · 2h ago
> It depends on your license?
Exactly this. Legal copying requires a license.
foota · 3h ago
I don't think this is a reasonable argument. I don't think copyright is actually defined in that sense, but is perhaps more focused on consuming the content. Is an http proxy making a copy of something? What about computing an md5 of it as it's streamed through the proxy? Or maybe counting the words in the thing being served in order to track stats? I'd argue none of these fall under copyright, but each is an incremental step towards what it means to train a model.
triceratops · 2h ago
> I don't think copyright is actually defined in that sense, but is perhaps more focused on consuming the content.
I'm not a legal expert. My layman's understanding of the case above is Aereo was in violation because they made copies of content - content that the receiver was already allowed to access - available over the Internet to the intended receiver. That is to say, the copying was the problem.
atomicnumber3 · 3h ago
It's almost like information wants to be free
No comments yet
nashashmi · 2h ago
Copyright law does not restrict storing copyright information. It restricts distribution of copyright data without permission. So a computer can store and analyze data but cannot spit it out verbatim. If it spits it out under fair use clause, then it becomes debatable whether the new work is fair use.
codedokode · 1h ago
Then why folks were arrested for filming in the cinemas? I don't think that's how the law works [1]:
> 106. Exclusive rights in copyrighted works
> Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:
> (1) to reproduce the copyrighted work in copies or phonorecords;
And later:
> 501. Infringement of copyright
> (a) Anyone who violates any of the exclusive rights of the copyright owner as provided by sections 106 through 122 or of the author as provided in section 106A(a), ..., is an infringer of the copyright or right of the author, as the case may be.
To me it seems clear that Zuckerberg violated author's exclusive right to reproduce copyrighted works. The law doesn't say it is ok to do if nobody knows about it.
For curious, what is considered a "copy":
> “Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. The term “copies” includes the material object, other than a phonorecord, in which the work is first fixed.
So an SSD with LLM weights should also be considered a "copy" if from them the work can be "reproduced".
Don't they mean that LLMs cannot perfectly reproduce the source material?
tomrod · 2h ago
They're only stochastically lossy compression -- so sometimes it can.
lsaferite · 21m ago
Given a unique enough arrangement of words and a low enough entropy in token selection.
giancarlostoro · 3h ago
I have a weird controversial view on this in terms of how to legally do it, and that is, for your 1 model, you should be only required to buy a digital copy of the work, maybe publishers should make digital copies that are tailored for LLMs to churn through, but then price it at a reasonable rate, and make the format basically perfect for LLMs.
romanzubenko · 2h ago
This is actually clever, let the market decide the price and the worth of each book for training. Pricing per model might be tricky, instead annual licensing for training might be better pricing structure. Very quickly all big publishers and big labs might find very precisely what the fair price is to pay per book/catalogue.
Lerc · 3h ago
I'm not sure if Meta did anything illegal in 2. either.
I thought the copyright infringement was by the people who provided the copyrighted material when they did not have the rights to do so.
I may be wrong on this, but it would seem a reasonable protection for consumers in general. Meta is hardly an average consumer, but I doubt that matters in the case of the law. Having grounds to suspect that the provider did not have the rights might though.
lsaferite · 17m ago
Even if your belief that only the person *providing* the content is liable, do you honestly think a single person found all the content, downloaded it, directly trained the model themselves, and then deleted the content? If at any step the content was given or shared to anyone else for any reason, have they not converted into a provider themselves?
singron · 3h ago
The original complaint alleges that the training process requires copying the material into the model and thus requires consent of the copyright holder. (Copyright protects copying but notably not use, so the complaint has to say they copied it in order to have standing). Then it says they didn't have consent.
They also mention Books3, but they don't appear to actually allege anything against Meta in regards to it and are just providing context.
I don't think it actually changes anything material about this complaint if Meta bought all the books at a bookstore since that also doesn't give you the right to copy the works.
The original complaint is 2 years old though, so I don't really know the current state of argumentation.
Note that incidental copying (i.e. temporary copies made by computers in order to perform otherwise legal actions) is generally legal, so "copying" in the complaint can't refer merely to this and must refer more broadly to the model itself being a copy in order to have standing.
rixthefox · 3h ago
> but it would seem a reasonable protection for consumers in general.
The final say may ultimately come from the Cox vs Record Labels case from 2019 that is still working it's way through the appeal courts.
If the record labels win their appeal, anyone who helped facilitate the infringement can be brought into a lawsuit. The record labels sued Cox for infringement by it's users. It's not out of the question that any ISP that provides Internet connectivity to Facebook could be pulled in for damages.
For Meta these two cases could result in an existential threat to the company, and rightly so because the record labels do not play games. The blood is already in the water.
Dylan16807 · 2h ago
So they promise to all their ISPs not to torrent again, and the ISPs keep accepting big piles of money to provide service?
I don't see how that's a threat to Meta.
blibble · 2h ago
Blizzard managed to get a copyright infringement win against a defendant company that merely accessed their game client (IP) in memory: a cheat reading values of player position
IP that had been previously loaded by Blizzard itself
The source being illegal doesn’t make your use legal. Infact one could argue that it’s equally illegal or worse since a corporation knowingly engaged in illegal activity.
knowitnone · 2h ago
So you're saying I can legally download movies as long as I don't provide them to others? Sweet!
nashashmi · 2h ago
All AI is “trained” on existing works. But it also works by outputting altered copied data. This output part is a copyright violation.
startupsfail · 2h ago
It’s weird that you are saying it’s unambiguously illegal. AFAIK, in some cases used for training were initially created by non-profits and transformed sufficiently to strip the copyrights.
alangibson · 3h ago
> AI doesn't actually directly copy the material it trains on, so it's not easy to make this ruling.
IANAL, but it doesn't look that hard. On first glance this is a fair use issue.
What an LLM spits out is pretty clearly transformative use. But the fact that it pulls not only the entirety of the work, but the entirety of MOST works means that the amount is way beyond what could be fair use. Plus it's commercial use. Put it together and all LLMs are way illegal.
Dylan16807 · 2h ago
> the fact that it pulls not only the entirety of the work
What do you mean by "pulls"?
What matters in traditional fair use is how substantially your output copies the work (among other factors). Your input is generally assumed to be reading/watching/listening to the entire work, and there is no problem with that.
ndiddy · 5h ago
The title for this submission is somewhat misleading. The judge didn't make any sort of ruling, this is just reporting on a pretrial hearing. He also doesn't seem convinced as to how relevant downloading books from LibGen is to the case:
> At times, it sounded like the case was the authors’ to lose, with [Judge] Chhabria noting that Meta was “destined to fail” if the plaintiffs could prove that Meta’s tools created similar works that cratered how much money they could make from their work. But Chhabria also stressed that he was unconvinced the authors would be able to show the necessary evidence. When he turned to the authors’ legal team, led by high-profile attorney David Boies, Chhabria repeatedly asked whether the plaintiffs could actually substantiate accusations that Meta’s AI tools were likely to hurt their commercial prospects. “It seems like you’re asking me to speculate that the market for Sarah Silverman’s memoir will be affected,” he told Boies. “It’s not obvious to me that is the case.”
> When defendants invoke the fair use doctrine, the burden of proof shifts to them to demonstrate that their use of copyrighted works is legal. Boies stressed this point during the hearing, but Chhabria remained skeptical that the authors’ legal team would be able to successfully argue that Meta could plausibly crater their sales. He also appeared lukewarm about whether Meta’s decision to download books from places like LibGen was as central to the fair use issue as the plaintiffs argued it was. “It seems kind of messed up,” he said. “The question, as the courts tell us over and over again, is not whether something is messed up but whether it’s copyright infringement.”
bgwalter · 4h ago
The RIAA lawyers never had to demonstrate that copying a DVD cratered the sales of their clients. They just got high penalties for infringers almost by default.
Now that big capital wants to steal from individuals, big capital wins again.
(Unrelatedly, has Boies ever won a high profile lawsuit? I remember him from the Bush/Gore recount issue, where he represented the Democrats.)
Majromax · 3h ago
> The RIAA lawyers never had to demonstrate that copying a DVD cratered the sales of their clients. They just got high penalties for infringers almost by default.
The argument for 'fair use' in DVD copying/sharing is much weaker since the thing being shared in that case is a verbatim, digital copy of the work. 'Format shifting' is a tenuous argument, and it's pretty easily limited to making (and not distributing) personal copies of media.
For AI training, a central argument is that training is transformative. An LLM isn't intended to produce verbatim copies of trained-upon works, and the problem of hallucination means an LLM would be unreliable at doing so even if instructed to. That transformation could support the idea of fair use, even though copies of the data are made (internally) during the training process and the model's weights are in some sense a work 'derived' from the training data.
If you analogize to human leaning, then there's clearly no copyright infringement in a human learning from someone's work and creating their own output, even if it "copies" an artist's style or draws inspiration from someone's plot-line. However, it feels unseemly for a computer program to do this kind of thing at scale, and the commercial impact can be significantly greater.
pessimizer · 2h ago
> there's clearly no copyright infringement in a human learning from someone's work and creating their own output, even if it "copies" an artist's style or draws inspiration from someone's plot-line.
What do you mean here by "clearly?" This is not at all clear, and court cases have been decided in the opposite direction.
is as far from what you say is "clearly" true as could possibly be. You're handwaving away the parts of the question that are difficult.
lsaferite · 7m ago
But that case involves publishing of content. If you want to compare them, it seems like you'd argue that the content of Pharrell Williams' head is somehow copyright infringement. I've yet to see anyone credible say that an AI model outputting exact content and then someone publishing that should be allowed. If you manage to make an AI model output copyrighted content, you can't then claim it's not copyrighted. If you sit on one side of a table and read a book out loud and I write it down (with minor transcription errors), that content is almost certainly still copyrighted content and I could not distribute it legally.
999900000999 · 4h ago
Meta needs these books.
They seek to convert them into more products. The needs of the copyright holders , who are relatively small businesses and individuals are outweighed by the needs of Meta.
Sarah wanting to watch a movie or listen to music... Too bad she doesn't have an elite team of lawyers to justify whatever she wants.
In practice Meta has the money to stretch this out forever and at most pay inconsequential settlements.
YouTube largely did the same thing, knowingly violate copyright law, stack the deck with lawyers and fix it later.
anshumankmr · 4h ago
Interesting figure that guy.
Here's this:
>Boies also was on the Theranos board of directors,[2][74] raising questions about conflicts of interest.[75] Boies agreed to be paid for his firm's work in Theranos stock, which he expected to grow dramatically in value.[75][3]
He was also the primary villain of John Carreyrou's account of Theranos' rise and fall -- Bad Blood -- as his firm attempted to bully and hound whistleblowers, and intimidate their families with baseless legal threats. Not a very nice or ethical guy.
anshumankmr · 2h ago
>also the primary villain
Hopefully second to Holmes,eh?
kranke155 · 4h ago
Copyright was invented (in its modern form) by corporations. It will be uninvented if need be for corporations.
Zambyte · 3h ago
I'm curious what you mean by "in it's modern form". You seem to suggest there was a previous form that was not invented by corporations, but I don't believe that is the case.
ryoshu · 3h ago
Copyright was first established by governments and the Church prior to the invention of corporations.
nickpsecurity · 3h ago
There were two, major differences in prior forms of copyright:
1. It protected works to reward authors during their lifetime. This was changed to lasting a long time after the author was dead. Then, also for corporations that were only persons on paper and theoretically immortal. This shift let companies squeeze money out of monopolized ideas for over a century rather than supporting artists and their creations. Instead of supporting the small fish, copyright law can reinforce the dominance of the sharks and whales.
2. Copyright was shorter in the U.S. at 28 years with possible renewal. That would balance two goals: give author time to make money off the work; let society use the work in a timeframe where it would still matter to them. Now, we can't have most works until long after they're useful in the market. We might not even speak the language they spoke, like older vs current English.
Personally, I'd love to see a limit of 5-20 years on copyrighted works. If authors want more money, they can make more stuff. Allowing remixes of culturally and technologically relevant content will create huge, thriving ecosystems. I think my concept is also proven out by the open source ecosystem.
A limit would also be great for legal AI. We could train them on all human content up to 5-20 years ago. Tons of jobs would be created digitizing and optimizing that content. Then, companies would pay to create or license modern content that updated those foundation models. Under current law, it would be impossible for smaller companies to build highly-competitive A.I.'s due to licensing cost and arbitrary restrictions.
morkalork · 3h ago
The golden rule strikes again!
ImPostingOnHN · 4h ago
If I remember correctly the legal precedent from that era, and if I'm summarizing correctly: Those who served or uploaded were considered to be infringing, since they were "making copies" by serving or uploading, whereas those who downloaded infringing copies were not themselves infringers. Meta in this case is at least described by the latter, and the question is whether LLM generation constitutes the former.
fmblwntr · 4h ago
ironically, he was the head lawyer on the legal team for napster (obviously a huge loss) but it accords well with your theory
cma · 2h ago
That's because that was statutory infringement where marketplace impact things come up more in fair use ("drummer reacts to hearing most famous drummer for the first time"). They look at whether it acts as a substitute for the original, but there are different rules depending on the type of fair use, how transformative it is, and more.
No, but the law treats it as bad as stealing when an individual copyright-infringes from a corporation, so why shouldn't it be as bad as stealing when a corporation copyright-infringes from an individual?
Of course, even this isn't enough, since corporations regularly steal (actually) from individuals, with near impunity.
doctorpangloss · 2h ago
Hacker News readers want simple, first principles answers that fit in a tweet, that require no reading, let alone case law, to understand.
This trial is way beyond the statutes and case law. The judge is doing a job, hard to conceive what the best job would be - I'm not sure Congress even knows what the policy should be or if the public has even the faintest wiff of how things should work.
pessimizer · 2h ago
> “It seems like you’re asking me to speculate that the market for Sarah Silverman’s memoir will be affected,” he told Boies. “It’s not obvious to me that is the case.”
"LLM, please summarize Sarah Silverman's memoir for me."
edit: Reader's Digest would be very surprised to know that they shouldn't have been paying for books.
Dylan16807 · 2h ago
If you do that, it won't be able to give you a summary detailed enough to infringe anything.
pessimizer · 2h ago
It may give me a summary good enough that I don't have to buy the book, since it read the book. If there are any parts that aren't detailed enough for me, I can ask them to be expanded.
If you're telling me that's not "infringing," you should follow what up with the argument for why it is not.
Dylan16807 · 2h ago
Lots of websites will give you summaries of books and they never get sued for that, let alone lose.
If you ask the LLM for the summary to be expanded much, and you're not providing it with a fresh copy of the book to reference, it's going to be wrong.
selfselfgo · 4h ago
To me it’s a totally insane argument from the judge, if it doesn’t stop the authors from making money on their works, then the judge is basically capping the income on all writers. The AI is totally useless without their knowledge and yet they have to prove they aren’t hurting its profits. Like these authors are entitled to derivative uses of their writing, if they’re not it’s a total farce.
granzymes · 1h ago
Title seems misleading after reading the article.
gtowey · 1h ago
It's mind blowing to me that the court might deny the right of the authors to control licensing of this kind of usage of their work.
No comments yet
TimPC · 5h ago
I think the headline is a bit misleading. Mets did pirate the works but may be entitled to use them under fair use. It seems like the authors are setting up for failure by making the case about whether the AI generation hinders the market for books. AI book writing is such a tiny segment what these models do that if needed Meta would simply introduce guard rails to prevent copying the style of an author and continue to ingest the books. I also don’t think AI generated fiction is anywhere near high quality enough to substantially reduce the market for the original author.
stego-tech · 5h ago
The problem is that "harm" as defined by copyright law is strictly limited to loss of sales due to breach of that copyright; it makes no allowment (that I know of) to livelihoods lost by the theft of the work indefinitely, as AI boosters suggest their tools can do (replace people). The way this court case is going, it's an uphill battle for the plaintiffs to prove concrete harm in that very narrow context, when the real harm is the potential elimination of their future livelihoods through theft, rather than immediately tangible harms.
As a (creative) friend of mine flatly said, they refuse to use an LLM until it can prove where it learned something from/cite its original source. Artists and creatives can cite their inspirational sources, while LLMs cannot (because their developers don't care about credit, only output) by design. To them, that's the line in the sand, and I think that's a reasonable one given that not a single creative in my circles has been cut payment from these multi-billion-dollar AI companies for the unauthorized use of their works in training these models.
ijk · 3h ago
A difficult, but not intractable problem: OLMoTrace claims to be able to trace from output to training data in seconds [1]. Notably, it can do this because OLMo itself was intentionally designed to be open and transparent [2]; it was trained on 4.6 trillion tokens of entirely open data (which you can download yourself) [3]. There's nothing stopping Meta or OpenAI from creating a similar tool, other than the obvious detail of that showing their exact training data.
I love it! Keeping this in my back pocket the next time someone claims that keeping accounting of training data and sourcing it isn't feasible or technically possible.
lopis · 4h ago
> Artists and creatives can cite their inspirational sources
Even humans have a lot of internalized unconscious inspirational sources, but I get your point.
msabalau · 3h ago
AI Boosters can suggest whatever nonsense strikes their fancy, and creatives can give into fear for no reason, but the best estimates we have from the BLS is that the there are careers and ongoing demand for artists, writers, photographers.
Regardless, deep learning models are valuable because they generalize within the training data to uncover patterns and features and relationships that are implicit, rather (simply) present with the data. While they can return things that happen to be within the training set, there is no reason to believe that any particular output is literally found there or is something that could be attributable, or that a human would ever attribute. Human artists also make meaning from the broad texture of their life experiences and general diffuse unattributable experience of culture.
Sure, this is something a random artist is unlikely to know, but if they are simply refusing to pick up a useful tools that can't give credit--say avoiding LLMs for brainstorming, or generative selection tools for visual editing, or whatever, their particular careers will be harmed by their incurious sentimentality, and other human artists will thrive because they know that tools are just tools, and it is the humans using the tools that make meaning that people care about.
tedivm · 5h ago
Github does give free copilot access to open source developers it considers important enough (which is a pretty low bar). While not the same as actually paying, it's the only example I can think of where the company that used people's copyrighted material actually gave something back to those people.
bgwalter · 4h ago
They want to track and utilize the new code that those developers are writing. And they want to keep them on GitHub. And they want to claim in potential lawsuits:
"See, those developers themselves have used CoPilot, so they approve the copyright infringement."
_aavaa_ · 4h ago
What you’re describing is the Extend phase of Microsoft’s plan.
mistrial9 · 4h ago
the educated and erudite can wait in line near the Castle; every day due to the grace of our masters, unused bread from the master's table is available without prejudice. These people can have a fine life, and the people are fulfilled.
immibis · 1h ago
Well, no, because all those file-sharing users used to get fined $250,000 or whatever, which is obviously much greater than the amount they would have paid for whatever they downloaded.
subscribed · 4h ago
Your friend might want to check Perplexity.
lopis · 4h ago
While Perplexity is able to show sources for the information it shows, the language part, and the body of text upon which it was trained, is a black box, and sources are not given, nor typically desirable as a user.
apercu · 4h ago
> but may be entitled to use them under fair use.
Why? Was it legal for me to download copyrighted songs from Limewire as "fair use"? Because a few people were made examples of.
I'm a musician, so 80% of the music I listen to is for learning so it's fair use, right? ;)
Filligree · 4h ago
> I'm a musician, so 80% of the music I listen to is for learning so it's fair use, right? ;)
I would be happy with that outcome. I’m a fanfiction writer, and a lot of the stories I read are very much for learning. ;-)
lukeschlather · 4h ago
I don't believe anyone was ever penalized for downloading only uploading which seems like a pretty similar principle to what the judge is saying here.
sillysaurusx · 2h ago
Heh. People were penalized for merely creating search engines that happened to link to songs. Supposedly the RIAA accepted the offer of a 20-something’s life savings, but only if they switched their major from CS to something else. I believe it, having witnessed those times.
If the result of this becomes that substantial remixes and fanfiction can be commercialized without permission from authors then I am happy. This stuff should have been fair use to begin with. Granted it probably already is fair use but because of the way copyright is enforced online it is effectively banned regardless.
immibis · 1h ago
If you used it for fair-use purposes, it could well have been legal. The only way to find out for sure would be to have them sue you, and then successfully or unsuccessfully defend yourself with a fair use argument. Please keep in mind that the law is a kind of stochastic process; "how illegal" something is dependent on how many times someone is found liable for it, which is something that takes a bunch of lawsuits to actually know, and each lawsuit is unique. It's not a computer program where if(X && Y && !Z) then punishment(); (well it sort of is, but X and Y and Z aren't definite boolean values, but things that have to be estimated based on evidence). (I am not a lawyer and this is not legal advice)
gabriel666smith · 4h ago
I think there's a really fundamental misunderstanding of the playing field in this case. (Disclaimer that my day job is 'author', and I'm pro-piracy.)
We need to frame this case - and ongoing artist-vs-AI-stuff -using a pseudoscience headline I saw recently: 'average person reads 60k words/day'.
I won't bother sourcing this, because I don't think it's true, but it illustrates the key point: consumers spend X amount of time/day reading words.
> It seems like the authors are setting up for failure by making the case about whether the AI generation hinders the market for books. AI book writing is such a tiny segment what these models do that if needed Meta would simply introduce guard rails to prevent copying the style of an author and continue to ingest the books.
and from the article:
> When he turned to the authors’ legal team, led by high-profile attorney David Boies, Chhabria repeatedly asked whether the plaintiffs could actually substantiate accusations that Meta’s AI tools were likely to hurt their commercial prospects. “It seems like you’re asking me to speculate that the market for Sarah Silverman’s memoir will be affected,” he told Boies. “It’s not obvious to me that is the case.”
The market share an author (or any other artist type) is competing with for Meta is not 'what if an AI wrote celebrity memoirs?'. Meta isn't about to start a print publishing division.
Authors are competing with Meta for 'whose words did you read today?' Were they exclusively Meta's - Instagram comments, Whatsapp group chat messages, Llama-generated slop, whatever - or did an author capture any of that share?
The current framing is obviously ludicrous; it also does the developers of LLMs (the most interesting literary invention since....how long ago?) a huge disservice.
Unfortunately the other way of framing it (the one I'm saying is correct) is (probably) impossible to measure (unless you work for Meta, maybe?) and, also, almost equally ridiculous.
internetter · 4h ago
> Meta did pirate the works but may be entitled to use them under fair use
What fair use? Were the books promised to them by god or something?
TimPC · 4h ago
Fair use allows for certain uses of copyrighted works without a specific license for those works. One of the major criterion is how transformative the work is and an LLM model is very different from the original work so it seems likely that criterion at least is met.
SideburnsOfDoom · 4h ago
> an LLM model is very different from the original work
True, but not the only relevant thing.
If the output of the LLM is "not very different from the original work" then the output could be the infringement. Putting a hypercomplex black box between the source work and the plagiarised output does not in itself make it "not infringing". The "LLM output as a service" business is then based on selling something based other people's work, that they do not have rights to.
It's falling for misdirection, "pay no attention to the LLM behind the curtain" to think otherwise.
Filligree · 4h ago
The output of the LLM is very different from the original, though. It’s hard to look at this and claim it isn’t.
int_19h · 2h ago
In the general case, yes, but they can verifiably reproduce at least some copyrighted works verbatim, which implies, at the minimum, that their content is stored in model weights in some fashion.
SideburnsOfDoom · 4h ago
> The output of the LLM is very different from the original, though
I will disagree with that characterisation. IMHO: In some cases no, it's not different, there are clear lines from inputs to output. In some cases yes, it's different from any one input work, it's distributed micro-plagiarism of a huge number of sources. In no case is it original.
But I think that this is legally undecided and won't be decided by you or me, and it is going to be a more interesting and relevant question than "is the LLM model is very like the original work", which it clearly isn't. That's like asking "is this typewriter like this novel?" It can't be, but the words that came out of it could be.
ijk · 2h ago
Yeah, the way the courts decide is unlikely to turn on a detail of how the technology works, so it's difficult for non-legal experts to predict the outcome on the technical merits (since the law has very different priorities).
Music has ended up in a place where short audio snippets are protected by copyright and must be licensed; but for short snippets of text the precedent has generally been that the copying needs to be more substantial. Distributed microplagarism of short phases might end up being ruled to be legal, even if wholesale reproduction is not. Which may not give copyright protection to the generated works, of course, as the question of machine authoring is entirely distinct.
sillysaurusx · 2h ago
> In some cases yes, it's different from any one input work, it's distributed micro-plagiarism of a huge number of sources. In no case is it original.
That’s like saying the dictionary is micro-plagiarism of a huge number of sources because it uses all the words from those sources.
SideburnsOfDoom · 2h ago
I disagree, you can't ask a dictionary to "generate 2000 words in the style of (author)".
sillysaurusx · 2h ago
So? Why is that, of all things, the crux of whether it’s copyright infringement?
Plagiarism isn’t necessarily copyright infringement, and plagiarism isn’t illegal. Copyright infringement is.
Even still, your argument that everyone who generates 2,000 words in the style of (author) is plagiarizing is also flatly false. By that standard all English essays that mimic someone else’s style would be plagiarism.
matkoniecz · 4h ago
"fair use" is a specific legal term
internetter · 3h ago
I'm aware. I was unsure what doctrine of fair use meta's behaviour could be defended as.
What am I, if not an LLM, ingesting copyrighted materials so that I may improve my own future outputs? Why is my own piracy not protected in the same manner?
ta1243 · 3h ago
> Why is my own piracy not protected in the same manner?
You aren't a multi-billion dollar company
ta1243 · 3h ago
In a specific legal jurisdiction.
The Berne convention mentions "fair practice", and puts the responsibility on the individual countries.
martin8412 · 1h ago
I wonder what will happen when inevitably someone sues one of the AI companies for copyright infringement in a different country, because fair use is to my knowledge an entirely American concept.
Where’s the threshold for forcing AI companies to retrain models without specific copyrighted works in them?
onlyrealcuzzo · 3h ago
> I also don’t think AI generated fiction is anywhere near high quality enough to substantially reduce the market for the original author.
Legal cases are often based on BS, really an open form of extortion.
The plaintiffs might've been hoping for a settlement.
Meta could pay $xM+ to defend itself.
Maybe they thought Meta would be happy to pay them $yM to go away.
The reality is, there's very little Meta couldn't just find a freely available substitute for if it had to, it might just take a little more digging on their end.
The idea that any one individual or small group is so valuable that can hold back LLMs by themselves is ridiculous.
But you'll find no end to people vain enough to believe themselves that important.
SideburnsOfDoom · 4h ago
Firstly, no kidding, of course it's "illegal" and "Piracy".
Secondly, there's an argument that the infringement happens only when the LLM produces output based in part of whole on the source material.
In other words, training a model is not infringing in itself. You could "research" with it. But selling the output as "from your model" is highly suspect. Your business is then based on selling something based other people's work, that you do not have rights to.
aurizon · 4h ago
Yes, current AI video/text product is inferior at this time. Youtube is full of all genres of inferior products - at this time!
The ramp of improvement is quite steeply pointing upwards. This is retrospective of the days of spinning jennies and knitting/weaving machines that soon made manual products un-economic - that said, excellent craft/art product endured on a smaller scale.
AI is also taking a toll on the movie arts, staring at the low end and climbing the same incremental improvement rungs. All the special effects(SFX) are in a similar boat. Prop rentals are hit hard. 100 high res photos of an old Studio Tv camera - all angles/sizes/lighting can be added to an AI prop library and with a green screen insert the prop can manifest as a true object in any aspect. There can be many. It still takes people to cull the hallucinations - a declining problem. Same with actors. They can be patterned after a famous actor - with likeness fees, or created de-novo. All the classic aspects of a studio production suffer the same incremental marginalisation - in 5 years = what will remain? - what new tech will emerge?
I feel that many forks will emerge, all fighting for a place in the sun = some will be weeded out, some will flower - but at a very high pace.
The old producers/directors/writers - the whole panoply of what makes a major studio will be scattered like dried bread crumbs,
kazinator · 4h ago
Do you not understand that "fair use" is not some copyright free-for-all which lets you use works wholesale without attribution as if they were suddenly public domain?
To make fair use of a book's passage, you have to cite it. The except has to be reasonably small.
Without fair use, it would not be possible to write essays and book reviews that give quotes from books. That's what it's for. Not for having a machine read the whole book so it can regurgitate mashups of any part of it without attribution.
Making a parody is a kind of fair use, but parodies are original expression based on a certain structure of the work.
danaris · 3h ago
> To make fair use of a book's passage, you have to cite it.
That's not true. That's what's required for something not to be plagiarism, not for something not to be copyright infringement.
Fair use is not at all the same as academic integrity, and while academic use is one of the fair use exceptions, it's only one. The most you would have to do with any of the other fair use exceptions is credit where you got the material (not cite individual passages), because you're not necessarily even using those passages verbatim.
kazinator · 1h ago
If your "fair use" is
- of a commercial nature;
- plagiarism;
- substantially large (e.g. whole work);
you're not on good legal footing.
dragonwriter · 2h ago
This is the source headline, but it is pure clickbait; the judge absolutely did not say that in any of the quotes in the article; in the hearing on both parties motions for partial sunmary judgement, he both said that would be the case if the plaintiffs proved certain facts and raised doubts that they have the evidence to prove them.
ryandrake · 3h ago
AI hucksters vs. the Copyright Cartel. When two evil villains fight, who do you root for? Here's hoping they somehow destroy each other.
pessimizer · 2h ago
Last time they fought, it was google vs. the publishers, and it resulted in the scanning and archiving of all of those books in the first place.
Neither of them died, though, both parties just kept all the books from the public and used them for their own purposes, while normal people had to squirrel them away and trade them illegally. It's the Tech Cartels vs. the Copyright Trolls. It'll end up as a romance.
probably_wrong · 2h ago
You can always root for the lawyers.
akomtu · 1h ago
The copyright cartel is going to lose because it represents the dying old world order of many competing enclaves. AI isn't just a sloppy text generator, it's the new ideology of forced uniformity that permits no boundaries. So no copyright cartels.
thomastjeffery · 2h ago
I can only root for them both to lose.
Letting Meta launder copyrighted works to make billions, while threatening the rest of us over the most trivial derivative work, sounds like the worst outcome to me.
Copyright is a mistake. It demands that we compete instead of collaborate. LLMs don't provide enough utility to deserve special treatment in these circumstances. If anyone can infringe copyright, then everyone should be able to.
labrador · 5h ago
"Chhabria is cutting through the moral noise and zeroing in on economics. He doesn't seem all that interested in how Meta got the data or how “messed up” it feels—he’s asking a brutally simple question: Can you prove harm?"
Where was this argument when Napster was being sued?
ebfe1 · 3h ago
And this is how Chinese model will win in long term, perhaps... They will be trained on everything and anything without consequences and we will all use it because these models are smarter (except for area like Chinese history and geography). I don't have the right answer on what can be done here to protect copyright or rather contributing back to authors of a paper without all these millions dollar wasted in lawsuits.
TrnsltLife · 2h ago
Reading the books changes the weights of the neural network. If ruled illegal, wouldn't it also become illegal for a human to read an illegally downloaded book? So far, I thought just redistribution was illegal.
Will the neural network (LLM) itself become illegal? Will its outputs be deemed illegal?
If so, do humans who have read an illegally downloaded book become illegal? Do their creative outputs become illegal?
delecti · 2h ago
Books are sold for the purpose of people reading them, including all the normal consequences that happen from a person reading a book. AI training being analagous to that doesn't unlock some cheat code that makes it legal, or reading books illegal. And it might indeed be found legal, but not for that reason.
codedokode · 1h ago
It's typical double standards policy: Google and Github remove links to pirated material (and pirated material itself) so that ordinary folks cannot download it for free, but when Zuckerberg downloads gigabytes of pirated material without paying, it's ok. The legal system doesn't want to put an ordinary folk and Zuckerberg at the same level.
Also I read that ordinary folks have been arrested for filming in the cinema even if they did not redistribute the video (due to being arrested). Again, it is unfair why they get arrested and Zuckerberg doesn't.
jayd16 · 2h ago
So to the legal peanut gallery here...
What is the substantive difference between training a model locally using these works that are presumably pulled in from some database somewhere and Napster, for example?
Would a p2p network for sharing of copyrighted works be legal if the result is to train a model? What if I promise the model can't reproduce the works verbatim?
caminanteblanco · 3h ago
I feel like this submission does a disservice by changing the title from the article's. It is misleading, and implied that the judge has already given a ruling, when they have not.
codr7 · 1h ago
Meta did what they always do, whatever they think they can get away with.
moregrift · 1h ago
This is the main reason why Chinese AI will be better than Western AI in the long term - Chinese companies can train on higher quality dataset (all the copyrighted books in the world)
nottorp · 22m ago
Only Meta?
zoobab · 2h ago
Just went to the public library and read a book to train my brain without permissions from the authors.
> “What about the next Taylor Swift?” he asked, arguing that a “relatively unknown artist” whose work was ingested by Meta would likely have their career hampered if the model produced “a billion pop songs” in their style.
I have this debate with a friend of mine. He's terrified of AI making all of our jobs obsolete. He's a brilliant musician and programmer both, so he's both enthused and scared. So let's go with the Swift example they use.
Performance Artists have always tried to cultivate an image, an ideal, a mythos around the character(s). I've observed that as the music biz has gotten more tough, that the practice of selling merch at shows has ramped up. Social media is monetized now. There's been a big diversification in the effort to make more money from everything surrounding the music itself. So too will it be with artists.
You're starting to see this already. Artists which got big not necessarily because of the music, but because of the weird cult of personality they built. One who comes to minds is Poppy, who ironically enough built a cult of personality around her being a fake AI bot...
You've definitely got counter-examples like Hatsune Miku - but the novelty of Miku was because of the artificiality (within a culture that, like, really loves robots and shit). AI pop stars will undoubtedly produce listenable records and make some money, but I don't expect that they will be able to replace the experience of fans looking for a connection with an artist. Watch the opening of a Taylor Swift concert, and you'll probably get it.
atrus · 4h ago
I think that argument is further hampered (taylor being an exception) by the fact that most pop stars already don't write their own songs. If people like Max Martin can pump out multiple hit songs for multiple groups, it kinda shows that who wrote the song doesn't matter.
Has making music for a living ever not been tough?
RajT88 · 4h ago
> Has making music for a living ever not been tough?
Fair.
> I think that argument is further hampered (taylor being an exception) by the fact that most pop stars already don't write their own songs.
That accounts for the big artists on the radio (yes some people listen to that). But, what about everyone else? I would posit that most artists (the one-hit wonders, the ones without radio success, etc.) write their own songs. It seems like there's such acts who make a go of it just fine, who write their own songs and really nail the connection with fans. I would point to a regional band near me: Mr. Blotto.
reverendsteveii · 4h ago
counterpoint: Gorillaz is a band specifically designed around the idea that the artist doesn't have to exist in order to do all of the things you mention above. Gorillaz has an image, a style, a mythos, all of that. Granted, when they started at the time (2000) there needed to be human creativity in order to create all of that but with AI everything about this can now be generated. I suspect it won't be long before all of our theorizing goes out the window because someone actually did create an act where everything, the music, the stage show, interviews, looks, all of it, is just AI-generated. That's when we'll get dollar votes on whether AI can actually generate a meaningful musical experience that people want to have.
RajT88 · 3h ago
Very relevant point. My own counter-example of Hatsune Miku is not totally AI generated - just generated using very sophisticated musical tools (see: Vocaloid).
There's some very impressive youtubers who are claiming to be generating new music with AI. The one I listen to the most I very much doubt has everything 100% generated - he probably generates a bunch of melodies and other bits of track and stiches the best candidates together. They do crank out a new album basically every 2 weeks though - and has just a scant few thousand followers. They are not making money, but the music is pretty on par with bands which sell hundreds of thousands or millions of albums.
This is part of what makes me think it's the people who can cultivate the mythos, the personality, the whole experience, who are going to be the big winners in the AI music economy. Sure, maybe Gorillaz obfuscates the identity of the artists (side note: do they though? it's well known to be a supergroup), but it still is a curated experience that human creativity was leveraged to create the whole experience.
codedokode · 58m ago
Every "Hatsune Miku" song has a talented person behind that avatar. Miku is just a synth and 3D model.
thomastjeffery · 2h ago
Would Taylor Swift be famous without the support of her label? I sincerely doubt it. How many equally talented artists
We should be careful not to conflate the affects of copyright to the affects of advertising.
adingus · 5h ago
I'm wondering if authors are making the same mistakes that the music industry did with Napster and kazaa. Using AI has led to more book purchases for me. If I discover and enjoy a book via AI I'm more inclined to buy it. The cats out of the bag, so pet him.
tacheiordache · 4h ago
Just look at the state of the music industry.
mtlynch · 4h ago
Can you share more about how you sample books with AI?
adingus · 2h ago
I don't really sample them, but if I want to know more about a subject I will normally ask for a book recommendation to go along with it.
nickpsecurity · 3h ago
It can tell you about authors, books, useful techniques, etc. If it cites references, that can generate page views on their site ir sales. It can also replace that, though, with AI supplier benefiting commercially.
throwacct · 4h ago
Only Facebook?!!
option · 2h ago
this is a huge issue AI companies in China do not have. the law must adjust now.
penguin_booze · 2h ago
Good. Now do OpenAI.
steele · 4h ago
In a just world, this would shutter the organization.
openplatypus · 1h ago
What is this just world you are talking about?
kazinator · 4h ago
The legal system is not going to be kind to the AI hucksters. Why? Because, quite stupidly and counterproductively, they have stepped on its toes by claiming that AI can replace lawyers. On top of that, there have been incidents of lawyers getting in hot water for generating slop instead of doing their work. So, this isn't just some distant, abstract tech issue for the lawyers and judges, like whether APIs should be copyrighted. If you're in any kind of business, you generally want these people to be on your side. Oopsies!
Mbwagava · 5h ago
Whether or not Meta wins this case, I'm never going to support any government that supports both LLMs and IP. Like we have to put up with IP despite having no clear value to a digital society but as soon as it becomes inconvenient it goes out the window? Nah, let's just trash the state and start over.
It's going to take centuries to undo the damage wracked by IP-supported private enterprise. And now we also have to put up with fucking chatbots. This is the worst timeline.
labrador · 4h ago
I hope you don't think I'm snarky because I'm serious. If you're an American citizen you can homestead in Alaska and cut yourself off from all this if you like.
edit: i'm serious. many americans would be much happier taking this option if they knew it existed. i may take it myself
quesera · 2h ago
Growing food in Alaska sounds like a meager existence, though.
labrador · 1h ago
People hunt and fish mostly. That's a big part of the appeal.
sillysaurusx · 2h ago
Homesteading is tremendously expensive, unfortunately. Most people can’t.
labrador · 2h ago
I didn't know that, but in that case there are a lot of young men and women on HN who are financically successful, but are tremendously unhappy. That's the case for me when I looked into it 25 years ago.
jMyles · 4h ago
The good news is that the internet is, fundamentally and in a way that no legacy state can alter, not a place where IP is cognizable.
You are free to copy bytes as you see fit, and the internet treats them identically whether they are random noise or whether a codec can turn them into music, film, books, or whatever inspires you.
The problem is that some humans, justifying their behavior by claiming it as "official", may act out with violence against you if they (rightly or wrongly, that's important to note) perceive that your actions are causing the internet to copy bytes to which they object.
Enduring nonviolence is likely yet ahead as consensus grows over the end of the legitimacy of these legacy states.
1. Training AI on freely available copyright - Ambiguous legality, not really tested in court. AI doesn't actually directly copy the material it trains on, so it's not easy to make this ruling.
2. Circumventing payment to obtain copyright material for training - Unambiguously illegal.
Meta is charged with doing the latter, but it seems the plaintiffs want to also tie in the former.
The judge in this case seems to disagree with you, not accepting the premise that downloading the material from pirate sites for this use inherently gets the plaintiffs an out from having to address fair use defense as to the actual use.
> the plaintiffs want to also tie in the former.
No, the defense wants to and the judge hasn't let the plaintiffs avoid it the way you argue they automatically can.
This is a good point, as a reminder, the Folsom tests (failing or passing any one is not conclusive, they are to be holistically considered) are:
- the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes (Note also that whether or not the use is transformative is part of this test).
- the nature of the copyrighted work
- the amount and substantiality of the portion used in relation to the copyrighted work as a whole
- the effect of the use upon the potential market for or value of the copyrighted work
https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors
What is inspiration? What is imitation? What is plagiarism? The lines aren't clearly drawn for humans... much less for LLMs.
I can absolutely guarantee you that neither DeepSeek nor Alibaba's highly talented Qwen group will care even a little bit, in the long run. Not if there's value to be had in AI. (And I can tell you down to the dollar what LLMs can save in certain business use cases.)
If the US decides to unilaterally shut down LLMs, that just means that the rest of the world will route around us. Whether this is good or bad is another question.
But then they'd have to actually communicate with people and negotiate consent instead of just hoovering up everything they can get their hands on in their quest to replace it.
LLMs are a drop on a hot stone compared to countless other factors why the world already is routing around the US - but I don't want to get political or economical.
Shouldn't they have to follow the law?
You're talking as if they are some kind of nationalized or publically-owned asset, as opposed to a bunch of for-profit, privately-owned silos.
Even if ChatGPT, Huggingface, etc. died, we would still have the models and we would still be able to run them.
Please do!!
My fear is that the US tech won’t be able to compete with state sponsored open source out of China and will move to ban open source or suppress it somehow.
And Alibaba's Qwen team seems to be quite genuinely talented at "small" models, 32B parameters and below. Once you get Qwen3 properly configured, it punches well above its "weight class." I'm still running real benchmarks, but subjectively, it feels like the 32B model performs somewhere between 4o-mini and 4o on "objectively measureable" tasks. It's a little "stodgy" and formal by default, though. We'll see what it looks like when people start fine-tuning it.
If the US dropped off the planet, it would maybe set LLM technology back a year.
US AI companies will either make sure that a similar ruling will never be made or they will ignore it and pay the fines. They won't let anybody stop the gravy train.
You assume that getting tested means the AI trainers lose, and also thar the model architectures that have been developed can’t be retrained from scratch with public domain, owned, and purpose-licensed material. (With several AI companies having been actively pursuing deals to license content for AI training for a while now.)
End of the road for major AI companies, and hopefully something better can be created once it's declared illegal without any murky waters.
There are LLMs trained on data that isn't illegally obtained, OLMo by Ai2 is one such model, that is actually open source and uses open data for training. Just because it's "very difficult" for OpenAI et al shouldn't be an argument to force them to behave ethically anyways. If they cannot survive acting legally, then so be it, sucks for them.
https://273ventures.com/kl3m-the-first-legal-large-language-...
So, it's really the majority of companies breaking the law who will be affected. Companies using permissible and licensed works will be fine. The other companies would finally have to buy large collections of content, too. Their billions will have go to something other than GPU's.
Not really sure a claim is good enough. I don't know that you can just go into court and say, "Trust me, I don't use copyrighted material."
And I also can't see any way, other than providing training data and training an identically structured model on that data, that a company can conclusively show that they got the weights in an allegedly copyright free model from the copyright free training data a company provides.
If you did not use copyrighted materials for training, people will not be able to prove that you did, and that should be good enough.
It's a civil matter not a criminal matter so that that doesn't apply.
There is benefit to using them, though. For one, they've tried really hard to be legal. That sets a positive example, shows good faith if they were sued, and reduces risk for those using them (good faith on our part). Also, one can be sure that they can ditch or replace any outputs in the long term if they're ruled illegal. So, we try not to use the A.I.'s in a way where losing access to them seriously damages our business.
That's the best I can offer until legal reforms happen.
If training, one can train it in Singapore on material you he or she has legal access to. Their law pretty much let's you use anything for AI purposes so long as you legally can access it yourself. To further reduce the risk, they should crawl it themselves, too, taking care to avoid risky sources.
So good luck finding the thing that looks exactly like your copyrighted work that's not in the corpus, if you can yeah, you might be able to prove it.
At the end of the day its like a lot of business, where a liability shell game is played out, and if the chain of evidence cant be drawn quite brightly then lawsuits would be frivolous at best.
Because the obvious question would be - how can free people compete with that?
Of course it does. Large models are trained on gigantic clusters. How can you train without copying the material to machines in the cluster?
If I produce a terrible shakycam recording of a film while sitting in a movie theater, it's not a verbatim copy, nor is it even necessarily representative of the original work -- muddied audio, audience sounds, cropped screen, backs of heads -- and yet it would be considered copyright infringement?
How many times does one need to compress the JPEG before it's fair use? I'm legitimately curious what the test is here.
That is why so called derivative works are allowed (and even encouraged). If copyrighted material is ingested, modified or enhanced to add value, and then regurgitated that is legal, whereas copying it without adding value is not legal.
If derivative works weren't deemed acceptable copyright would have the opposite of it's intended effect and become an impediment to progress.
Derivative works are not given a free pass from the normal constraints of copyright. You cannot legally publish books in the universe of A Song of Ice and Fire without permission from the author (and often publisher), calling them “derivative works.”
It’s why fan fiction is such a gray area for copyright and why some publishers have historically squashed it hard.
The exceptions for this are typically fair use, which requires multi-factor analysis by the judiciary and is typically decided on a case-by-case basis.
Derivative works are tolerated in some cases like some manga or fanfics but it is a gray area and whenever the author or publisher wants to pursue it it is their full right to do it. Many do pursue it
(You can get inspired by something, and this is where some arguments can happen if you get inspired too mmmm literally, but no one will say with a straight face that inspiration is a thing that happens to software)
So… it’s complicated. This is one of the weird areas where music copyright and other copyright seem to differ in the US.
In the US the situation is complex and there are a lot of weird special interests [0], but generally a composer/author of a song has the right to decide who first records and releases the song, but after the first recording covers require a mechanical license, which is compulsory (ie: the author cannot object).
In music there are _a lot_ of special cases and different rights are decided with different kinds of licenses, some of which are compulsory. I think it’s an area that doesn’t make for good analogies with copyright in other media.
0: https://en.m.wikipedia.org/wiki/Mechanical_license
If the work is "derivative" in the legal sense it is copyrighted, and you may not create derivative works without the copyright holders permission.
What I should have said is that simply being inspired by a work or copying unprotectable elements (like facts or ideas) does not create a derivative work.
For example, if ChatGPT were to generate Star Wars, except with Dookies instead of Wookies, that might be illegal. If it were to learn what a spaceship is from Star Wars and then create something substantially new it would not. The key is is that it must not be substantially similar to the original. You must add enough value that it becomes something new, not just rehash the original.
When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.
The point here is that book files have to be copied before they can be used for training. Copyright texts typically say something like "No unauthorised copying or transmission in any form (physical, electronic, etc.)"
Individuals who torrented music and video files have been bankrupted for doing exactly this.
The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.
If this is enforced (still to be seen...) it would be financially catastrophic for Meta, because there are set damages for works that have been registered for copyright protection - which most trad-pubbed books, and many self-pubbed books, are.
Only if they seeded the data and some other entity downloaded it, i.e. they hosted the data. In a previous article I believe it was called out that Meta was being a leecher (not seeding back what they downloaded).
It's the hosting that gets you, not the act of downloading it.
The lender owns the book, and it is within his rights to loan it to whoever he wants. That is legal. Making this illegal would end libraries.
The borrower is well within his rights to accept the book, and as the current owner he is even allowed to make a copy of the book (see the famous TIVO case). Making this illegal would end backups and format/time shifting.
When the borrower returns the book, he keeps the copy. Oh no! Surely he must now become a criminal? Nope. Possessing an unauthorized copy is also not illegal, despite what many copyright holders would like you to believe. Making this illegal would also criminalize a lot of legitimate format/time shifting, again see the famous TIVO case.
If the borrower were to loan his homemade copy to someone else THEN it would finally become illegal.
Nothing about AI changes any of this.
I download a torrent with movie that I didn't pay for. If I don't allow to seed it, then I don't get in trouble. If I let it seed either during the download process or after, I'd get a DMCA notice if that torrent/magnet link was getting tracked.
I don't need a hypothetical book, that is just how it works if I were to download illegally obtained documents/media.
As technical as people are in this thread, easy to tell when folks didn't have their parents wondering why they were getting scary letters from the ISP.
But if you make books contents available online via some service that regurgitates its contents you would be totally sus because you can be considered in a business of selling derivative works.
Seems like a big gap there.
> Using software almost always involves creating copies, even though many of these copies only exist for a very short time. For example, executing a program means copying it from the hard disk into RAM so that the CPU can interpret the instructions. Because of this, the right to run a program is considered to fall under the copyright of the author.
For comparison, when a human looks at the letters, there is no copying.
Also, models can reproduce text verbatim which proves that they store it.
So it is unfair when ordinary folks got sued for this and Zuckerberg wants to get away with a million times larger violation. He must go directly to jail.
[1] https://www.iusmentis.com/copyright/software/rights/
How is this done? Are bits not written into RAM or disk? Are they not sent between machines in a training cluster? That's copying.
> it is seemingly not far removed from how humans consume content
Except that humans don't make full copies to RAM, or disk or paper.
AI doesn't need lasting copies to train, however I don't know what the actual implementation is. But if it's ruled that they can only use copyrighted data if it's not stored for more than the time it would take a human to consume, It wouldn't really cripple the models, but perhaps make training more logistically challenging.
It's important to understand that models are not data archives. They are statistical constructs made from getting quizzed, that uses human made content to generate the quiz questions.
Wired explicitly sent that article to their computer for the purposes of reading it so it's not a copyright violation.
Images on your retina form exact copies.
They are scanned and translated into impulses that are then sent to a first set of "neural columns" - that's an exact copy.
This is then connected to the visual cortex by the two most high bandwidth links in the human body ("the optical nerve", there's 2 of them of course, always wondered why everybody insists on using the singular). Why would you have that high bandwidth link unless to create verbatim copies.
The way those columns are structured also very strongly suggests they make carbon copies, which they then make available on the "brain bridge" (which is probably at least vaguely similar to the "attention matrix" of a transformer). If it does work like that, that's also a verbatim copy.
The only way "humans don't make full copies to RAM" is that humans don't have separate RAM. The processing power is colocated with the processing, even on a microscopic level. You know, what everybody knows is the best way of doing things even in silicon, it's just incredibly impractical if you can't rebuild your circuit every time there's a slight change to the instructions your "computer" carries out (the brain is not a "Von Neumann architecture", except it kind of is when it regrows connections. But in the short term it isn't)
Not for the purposes of copyright law.
> is that humans don't have separate RAM [or disk]
And that turns out to be incredibly important. Humans can't create a lasting, shareable copy of a copyrighted work by consuming it.
The computer model is working differently of course but functionally it's the same idea.
The test is if a judge says it is fair use, nothing else.
The judge will take into account the human factor in this matter, e.g. things like who did the actual work, and who just used an algorithm (which is not the hard part anymore, code can be obtained on the internet for free). And we all know that DL is nowhere without huge amounts data.
It seems like it is very much a matter of fidelity.
As mentioned in another comment, LLMs (and most popular machine learning algorithms) can be viewed, correctly, as compression algorithms which leverage lossy encoding + interpolation to force a kind of generalization.
Your argument is that a video wouldn't count as pirated if the compression used for the pirated copy was lossy (or at least sufficiently lossy). The closest real world example would be the cases where someone records a the filming of a movie on their phone then uploads it. Such a copy is lossy enough that you can't produce anything really like the original, but my most definitions is still considered copyright.
You would never use a human to backup your financial reports, but the human might be able to give a good overview. You would never use an LLM to backup your financial reports, but they might be able to give a good overview.
AI training data is disposable. There is nothing that could be called a compression algorithm that disposes all of the data you put into it. AI uses training data as examples of what the next token in a token sequence is. The examples are disposable reference points, not the model itself. That's how you get image models that are 20GB in size despite training on 20PB of data. It's 20PB of examples used to form the shape of a 20GB model. You could show it 5GB of training data or 500EB of training data and it would still be 20GB - because it is not a compression algo, it's a 20GB shape formed by external data.
I'm sorry, but this a fundamentally incorrect view of machine learning (including, but not limited to transformers).
From an information theoretic perspective the two are essentially identical with the exception that standard compression algorithms do not have a proper "loss" function other than just trying to minimize reconstruction loss with the resulting compression size.
Here's a link to the section on the Wikipedia for more information if you'd like [0]. MacKay's Information Theory, Inference and Learning Algorithms is the standard full text treatment of this topic [1]. Ted Chiang's article "ChatGPT is a Blurry JPEG of the web" is pretty good "pop sci" exploration of this topic if you don't want to get too into the mathematics [2].
0. https://en.wikipedia.org/wiki/Data_compression#Machine_learn...
1. https://www.inference.org.uk/itprnn/book.pdf
2. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...
Humans are totally capable of data compression. This will just devolved into a semantics game of what a data compressor is.
LLMs were not developed to be, do not function as, and are not use as data compression utilities. Please, come knocking when a service provider exists that will use LLM's to compactly store your company data.
https://arxiv.org/pdf/2309.10668
Transformers are also used in the top algorithm right now on the Large Text Compression Benchmark. https://bellard.org/nncp/nncp.pdf
Again, from a information theoretic view point, this is exactly what they are doing, how they where developed and how they function.
I don't know any serious researcher in ML that would find this claim even remotely controversial. It's really not just "a semantics game", its a part of a foundational understanding of the topic. If you want to understand LLMs from this perspective, a good place to start is with an auto-encoder which does try to learn a standard compression algorithm, the move on to more sophisticated embedding models (found in a lot of recommender systems) which try to learn an additional objective on top of minimizing reconstruction error. You'll then see that Transformers and all other major NN architectures fall out of these basic principles.
> Please, come knocking when a service provider exists that will use LLM's to compactly store your company data.
This is literally what every vectordb company does right now, as well as all "chat with your docs" type startups.
Can you get transformers to regurgitate information verbatim? Yes.
Would anyone in their right mind rely on a transformer to do so? No.
Would anyone in their right mind rely on a vectorDB to do so? No.
Would anyone in their right mind use a vectorDB/RAG/SQL/transformer combo to do so? Yes.
Is youtube going to drop VP9 for GeminiEncode to save google billions in bandwidth? No.
You can compress 20PB of text to 20Gb or even less, if input is super repetitive. So the same with images, if 50% of the images are cats then you learn how to represent the cat pixels with a few vectors and then you could represent all the cats int he world doing all possible cat actions.
But please have the courage to respond to this, when the AI is caught regurgitating the exact text from a popular book, the exact verses from a poem, the exact code function from some code , then how can you defend that is not memorizing things? If a human uses my poem(after they read it) and signs his name under it would you defend them?
And yes LLMs can recall exact material, but it is excerpts and fragments. There is statistical significance to it's ordering. Humans readily do this too (excerpts and fragments), most artists can draw a batman symbol (but not an episode of batman). That doesn't in anyway mean that artists should not be allowed to ever see a batman symbol. It means that artists shouldn't be allowed to get paid to draw one. And they are not. And LLMs are not exempt either.
But the fix is output filtering, just like everything else that can violate copyright. Which is already being done (albeit poorly, but way better than 2 years ago), the same as artists will not draw the batman symbol for you despite being able to.
maybe even simpler, I create a zip format where I randomly replace words with their synonym, or group of words with something equivalent. Would you defend this as original ? why my random transformations are not original while mathematical transformations you will defend ?
And how can you suggest putting output filters to protect only the giants for copyright and everyone else gets screwed.
We should be building robust copyright filters and everyone should be able to contribute their work to it.
*but that is a different issue than whether or not an LLM is legally allowed to view a work that is publicly available."
Again, pretty much every artist is capable of off-hand copyright violation on the spot. This has been true forever. We don't bar them from seeing art to prevent this.
Needing the original material isn't enough for claiming copyright infringement as we have existing counter examples
The model isn’t storing the book.
I think that is the center of the conversation. What does it mean for a computer to "understand"? If I wrote some code that somehow transformed the text of the book and never return the verbatim text but somehow modified the output, I would likely not be spared because the ruling will likely be my transformation is "trivial".
Personally, I think we have several fixes we need to make:
For brand protection, we already have trademark law. Most readers here already know this but We really should severe the artificial ties we have created between patents, trademarks, and copyright.https://www.gnu.org/philosophy/not-ipr.en.html
I’m sure all these ‘clever’ questions would be useful if this trial was about humans but it’s not.
The training material is used to play this guessing game to dial in it's weights. The training data is picked up, used as reference material for the game, and then discarded. It's hard to place this far from what humans do when reading, because both are using the information to mold their respective "brains" and both are doing an acquire, analyze, discard process.
At no point is training data actually copied into the model itself, it's just run past the "eyes" of the model to play the training game.
In early computing, everything was closed sourced. Quoting the wikiepdia page,
To develop a legal BIOS, Phoenix used a clean room design. Engineers read the BIOS source listings in the IBM PC Technical Reference Manual. They wrote technical specifications for the BIOS APIs for a single, separate engineer—one with experience programming the Texas Instruments TMS9900, not the Intel 8088 or 8086—who had not been exposed to IBM BIOS source code.
The legal team at Phoenix deemed inappropriate to "recall source in their own words" for legal reasons.
My non-legal intuition is that these companies training their models are violating copyright. But, the stakes are too high--it's too big to fail if you will. If we don't do it, then our competitors will destroy us. How do you reconcile that?
But any such arrangement needs to be hammered out by the legislature. As laws are, I think it's pretty clear that infringement is happening.
Not of physical media. You're allowed to make archival copies of digital media.
> Or reading a book via a computer would be illegal
No you purchased a license (or your library did, in the case of e-borrowing) to read the book on a computer. That makes it legal.
This scenario seems quite contrived but is there an actual court precedent allowing it? I'm 100% confident no one will ever prosecute you for doing it but that's not the same thing as "allowed".
In another thread I already posted about https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....
This case was about pointing <recording equipment> at <content someone is allowed to access> and making a copy to transmit it to <only that person>. The Supreme Court held it to be illegal. There was a lot of money on the line, which is why it went so far.
Your example involves transmission and mine doesn't, and that's a whole different can of worms.
Also the result of that case was self-contradicting so it's not a great basis to build too much logic upon.
I'm not aware of this principle. Where is it spelled out?
> Also the result of that case was self-contradicting
I agree the verdict was a travesty. An innovative business went to ridiculous lengths to stay on the right side of the copyright mafia (data centers with tiny individual TV antennas for each subscriber FFS!) while providing a better product and experience. They still had the hammer brought down on them.
Well, do CDs give you a license agreement that allows you to copy the data? I've never seen one. And it's nearly impossible to play a CD without that copying.
Exactly. It's how you are supposed to use the CD. That's not true for your example of a book on a webcam. You're supposed to read the book, not an image of the book.
That's not the same thing as copying for "processing".
Would it be a violation to play back a record like a CD and have a digital buffer? That would be pretty silly.
Selling you the CD explicitly grants you the right to play it using a CD player.
> Would it be a violation to play back a record like a CD and have a digital buffer?
I genuinely have no idea lol.
It becomes illegal if I try to distribute those copies
So the question is, does distributing an AI that has been trained on Harry Potter count as distributing Harry Potter?
The NYTimes in 2023 was able to demonstrate that the models can reproduce entire articles verbatim[0] with minimal coercion.
[0]https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20...
It would be a remarkable quirk of statistics that, if given all text on the internet except for the NYTimes back catalogue, a model would produce any NYT article.
While a tool being used to create infringing copies of some other work (whether or not it is the source material used to create the tool, and whether or not the infringing material is also verbatim copies) is relevant to whether the tool vendor is liable for contributory infringement for the infringing use of the tool, the absence of a capacity for creating such copies isn't usually enough to say that copying to make the tool isn't infringing.
(That said, generative AI tools, including LLMs specifically, have been shown to have the capacity to make such copies, to the extent that vendors of hosted models are now putting additional checks on output to try to mitigate the frequency with which verbatim copies of substantial portions of training-set works are produced, so arguing that LLMs can't do that is silly.)
There’s plenty of case law there…
Using content to train an LLM is not copying the content. I'm ignoring the silly "but actually" arguments about the content being in RAM so it's "copying". It's using the content to generate a statistical model of token (word-ish) relationships and probabilities. If you write content that is so original in it's wording and I train an LLM against it, then there is certainly the possibility that the LLM could be provoked the recall the exact words you used. You'd have to set the parameters just right to make it happen and I think that proper training would drastically lower if not remove that possible scenario. But even if it doesn't, the LLM doesn't have a copy of that original content. All it has is weights representing those relationship probabilities. Yes, the minutia is more complex, but that is the essence. If my LLM were to generate enough of this essentially verbatim unique content and I tried to publish or copyright it, then I as the user should be on the hook. But then you get into a discussion about how many words in a unique sequence does it take to be infringement?
Obviously, I am not a lawyer.
My summation in all of this is that new laws need to be put into place to handle this stuff because the existing ones are sufficiently non-definitive and/or ill-suited such that every party is forming strong opinions about how old laws apply to new situations and causing massive friction.
Transformers are fundamentally large compression algorithms where the target of compression is not just to minimize reconstruction loss + compressed file size. In fact, basically all of machine learning used today can be viewed through the lens of learning a compression algorithm with added goals other than the usual.
By this logic if I create a lossy Jpeg of a copyrighted image it's not "copying" because the lossy compression.
If we're going that way, let me torrent every movie and TV show ever to "train" myself.
Copyright doesn't protect against all forms of duplication. For instance, you own the copyright to your post and grant HN a license to offer copies of it. I have no direct license from you to copy the content of your post; but I can copy it to memory, copy a cache to disk, and make a copy appear on my display.
It’s not a good example, because if you grant a license you give them the right to make copies. The problem is not when Meta got licenses, it’s when they did not.
Where your analogy goes wrong is you're saying you want to "[Circumvent] payment to obtain copyright material for training" to use Workaccount2's words.
Because I'm certainly not allowed to photocopy a library book in its entirety. And I guarantee you a Netflix subscription doesn't allow me to keep a copy of a movie on my hard drive and use it for training man or machine.
IANAL but that probably falls under fair use? You'll get in trouble if you photocopy the work and sell access to it.
I've not found case law for that. I've had this same argument on HN multiple times over the past few months.
Exactly this. Legal copying requires a license.
https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....
I'm not a legal expert. My layman's understanding of the case above is Aereo was in violation because they made copies of content - content that the receiver was already allowed to access - available over the Internet to the intended receiver. That is to say, the copying was the problem.
No comments yet
> 106. Exclusive rights in copyrighted works
> Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following:
> (1) to reproduce the copyrighted work in copies or phonorecords;
And later:
> 501. Infringement of copyright
> (a) Anyone who violates any of the exclusive rights of the copyright owner as provided by sections 106 through 122 or of the author as provided in section 106A(a), ..., is an infringer of the copyright or right of the author, as the case may be.
To me it seems clear that Zuckerberg violated author's exclusive right to reproduce copyrighted works. The law doesn't say it is ok to do if nobody knows about it.
For curious, what is considered a "copy":
> “Copies” are material objects, other than phonorecords, in which a work is fixed by any method now known or later developed, and from which the work can be perceived, reproduced, or otherwise communicated, either directly or with the aid of a machine or device. The term “copies” includes the material object, other than a phonorecord, in which the work is first fixed.
So an SSD with LLM weights should also be considered a "copy" if from them the work can be "reproduced".
[1] https://www.copyright.gov/title17/92chap1.html#106
I thought the copyright infringement was by the people who provided the copyrighted material when they did not have the rights to do so.
I may be wrong on this, but it would seem a reasonable protection for consumers in general. Meta is hardly an average consumer, but I doubt that matters in the case of the law. Having grounds to suspect that the provider did not have the rights might though.
They also mention Books3, but they don't appear to actually allege anything against Meta in regards to it and are just providing context.
I don't think it actually changes anything material about this complaint if Meta bought all the books at a bookstore since that also doesn't give you the right to copy the works.
The original complaint is 2 years old though, so I don't really know the current state of argumentation.
https://www.courtlistener.com/docket/67569326/1/kadrey-v-met...
Note that incidental copying (i.e. temporary copies made by computers in order to perform otherwise legal actions) is generally legal, so "copying" in the complaint can't refer merely to this and must refer more broadly to the model itself being a copy in order to have standing.
The final say may ultimately come from the Cox vs Record Labels case from 2019 that is still working it's way through the appeal courts.
If the record labels win their appeal, anyone who helped facilitate the infringement can be brought into a lawsuit. The record labels sued Cox for infringement by it's users. It's not out of the question that any ISP that provides Internet connectivity to Facebook could be pulled in for damages.
For Meta these two cases could result in an existential threat to the company, and rightly so because the record labels do not play games. The blood is already in the water.
I don't see how that's a threat to Meta.
IP that had been previously loaded by Blizzard itself
https://en.wikipedia.org/wiki/MDY_Industries,_LLC_v._Blizzar....
IANAL, but it doesn't look that hard. On first glance this is a fair use issue.
What an LLM spits out is pretty clearly transformative use. But the fact that it pulls not only the entirety of the work, but the entirety of MOST works means that the amount is way beyond what could be fair use. Plus it's commercial use. Put it together and all LLMs are way illegal.
What do you mean by "pulls"?
What matters in traditional fair use is how substantially your output copies the work (among other factors). Your input is generally assumed to be reading/watching/listening to the entire work, and there is no problem with that.
> At times, it sounded like the case was the authors’ to lose, with [Judge] Chhabria noting that Meta was “destined to fail” if the plaintiffs could prove that Meta’s tools created similar works that cratered how much money they could make from their work. But Chhabria also stressed that he was unconvinced the authors would be able to show the necessary evidence. When he turned to the authors’ legal team, led by high-profile attorney David Boies, Chhabria repeatedly asked whether the plaintiffs could actually substantiate accusations that Meta’s AI tools were likely to hurt their commercial prospects. “It seems like you’re asking me to speculate that the market for Sarah Silverman’s memoir will be affected,” he told Boies. “It’s not obvious to me that is the case.”
> When defendants invoke the fair use doctrine, the burden of proof shifts to them to demonstrate that their use of copyrighted works is legal. Boies stressed this point during the hearing, but Chhabria remained skeptical that the authors’ legal team would be able to successfully argue that Meta could plausibly crater their sales. He also appeared lukewarm about whether Meta’s decision to download books from places like LibGen was as central to the fair use issue as the plaintiffs argued it was. “It seems kind of messed up,” he said. “The question, as the courts tell us over and over again, is not whether something is messed up but whether it’s copyright infringement.”
Now that big capital wants to steal from individuals, big capital wins again.
(Unrelatedly, has Boies ever won a high profile lawsuit? I remember him from the Bush/Gore recount issue, where he represented the Democrats.)
The argument for 'fair use' in DVD copying/sharing is much weaker since the thing being shared in that case is a verbatim, digital copy of the work. 'Format shifting' is a tenuous argument, and it's pretty easily limited to making (and not distributing) personal copies of media.
For AI training, a central argument is that training is transformative. An LLM isn't intended to produce verbatim copies of trained-upon works, and the problem of hallucination means an LLM would be unreliable at doing so even if instructed to. That transformation could support the idea of fair use, even though copies of the data are made (internally) during the training process and the model's weights are in some sense a work 'derived' from the training data.
If you analogize to human leaning, then there's clearly no copyright infringement in a human learning from someone's work and creating their own output, even if it "copies" an artist's style or draws inspiration from someone's plot-line. However, it feels unseemly for a computer program to do this kind of thing at scale, and the commercial impact can be significantly greater.
What do you mean here by "clearly?" This is not at all clear, and court cases have been decided in the opposite direction.
This case: https://www.reuters.com/article/lifestyle/marvin-gaye-family...
is as far from what you say is "clearly" true as could possibly be. You're handwaving away the parts of the question that are difficult.
They seek to convert them into more products. The needs of the copyright holders , who are relatively small businesses and individuals are outweighed by the needs of Meta.
Sarah wanting to watch a movie or listen to music... Too bad she doesn't have an elite team of lawyers to justify whatever she wants.
In practice Meta has the money to stretch this out forever and at most pay inconsequential settlements.
YouTube largely did the same thing, knowingly violate copyright law, stack the deck with lawyers and fix it later.
Here's this: >Boies also was on the Theranos board of directors,[2][74] raising questions about conflicts of interest.[75] Boies agreed to be paid for his firm's work in Theranos stock, which he expected to grow dramatically in value.[75][3]
https://en.wikipedia.org/wiki/David_Boies
That was one of the decisions of all time.
1. It protected works to reward authors during their lifetime. This was changed to lasting a long time after the author was dead. Then, also for corporations that were only persons on paper and theoretically immortal. This shift let companies squeeze money out of monopolized ideas for over a century rather than supporting artists and their creations. Instead of supporting the small fish, copyright law can reinforce the dominance of the sharks and whales.
2. Copyright was shorter in the U.S. at 28 years with possible renewal. That would balance two goals: give author time to make money off the work; let society use the work in a timeframe where it would still matter to them. Now, we can't have most works until long after they're useful in the market. We might not even speak the language they spoke, like older vs current English.
Personally, I'd love to see a limit of 5-20 years on copyrighted works. If authors want more money, they can make more stuff. Allowing remixes of culturally and technologically relevant content will create huge, thriving ecosystems. I think my concept is also proven out by the open source ecosystem.
A limit would also be great for legal AI. We could train them on all human content up to 5-20 years ago. Tons of jobs would be created digitizing and optimizing that content. Then, companies would pay to create or license modern content that updated those foundation models. Under current law, it would be impossible for smaller companies to build highly-competitive A.I.'s due to licensing cost and arbitrary restrictions.
[0] https://en.m.wikipedia.org/wiki/Dowling_v._United_States_(19...
Of course, even this isn't enough, since corporations regularly steal (actually) from individuals, with near impunity.
This trial is way beyond the statutes and case law. The judge is doing a job, hard to conceive what the best job would be - I'm not sure Congress even knows what the policy should be or if the public has even the faintest wiff of how things should work.
"LLM, please summarize Sarah Silverman's memoir for me."
edit: Reader's Digest would be very surprised to know that they shouldn't have been paying for books.
If you're telling me that's not "infringing," you should follow what up with the argument for why it is not.
If you ask the LLM for the summary to be expanded much, and you're not providing it with a fresh copy of the book to reference, it's going to be wrong.
No comments yet
As a (creative) friend of mine flatly said, they refuse to use an LLM until it can prove where it learned something from/cite its original source. Artists and creatives can cite their inspirational sources, while LLMs cannot (because their developers don't care about credit, only output) by design. To them, that's the line in the sand, and I think that's a reasonable one given that not a single creative in my circles has been cut payment from these multi-billion-dollar AI companies for the unauthorized use of their works in training these models.
[1] https://arxiv.org/abs/2504.07096
[2] https://allenai.org/blog/olmotrace
[3] https://huggingface.co/datasets/allenai/olmo-mix-1124
Even humans have a lot of internalized unconscious inspirational sources, but I get your point.
Regardless, deep learning models are valuable because they generalize within the training data to uncover patterns and features and relationships that are implicit, rather (simply) present with the data. While they can return things that happen to be within the training set, there is no reason to believe that any particular output is literally found there or is something that could be attributable, or that a human would ever attribute. Human artists also make meaning from the broad texture of their life experiences and general diffuse unattributable experience of culture.
Sure, this is something a random artist is unlikely to know, but if they are simply refusing to pick up a useful tools that can't give credit--say avoiding LLMs for brainstorming, or generative selection tools for visual editing, or whatever, their particular careers will be harmed by their incurious sentimentality, and other human artists will thrive because they know that tools are just tools, and it is the humans using the tools that make meaning that people care about.
"See, those developers themselves have used CoPilot, so they approve the copyright infringement."
Why? Was it legal for me to download copyrighted songs from Limewire as "fair use"? Because a few people were made examples of.
I'm a musician, so 80% of the music I listen to is for learning so it's fair use, right? ;)
I would be happy with that outcome. I’m a fanfiction writer, and a lot of the stories I read are very much for learning. ;-)
[0] https://torrentfreak.com/meta-says-it-made-sure-not-to-seed-...
We need to frame this case - and ongoing artist-vs-AI-stuff -using a pseudoscience headline I saw recently: 'average person reads 60k words/day'.
I won't bother sourcing this, because I don't think it's true, but it illustrates the key point: consumers spend X amount of time/day reading words.
> It seems like the authors are setting up for failure by making the case about whether the AI generation hinders the market for books. AI book writing is such a tiny segment what these models do that if needed Meta would simply introduce guard rails to prevent copying the style of an author and continue to ingest the books.
and from the article:
> When he turned to the authors’ legal team, led by high-profile attorney David Boies, Chhabria repeatedly asked whether the plaintiffs could actually substantiate accusations that Meta’s AI tools were likely to hurt their commercial prospects. “It seems like you’re asking me to speculate that the market for Sarah Silverman’s memoir will be affected,” he told Boies. “It’s not obvious to me that is the case.”
The market share an author (or any other artist type) is competing with for Meta is not 'what if an AI wrote celebrity memoirs?'. Meta isn't about to start a print publishing division.
Authors are competing with Meta for 'whose words did you read today?' Were they exclusively Meta's - Instagram comments, Whatsapp group chat messages, Llama-generated slop, whatever - or did an author capture any of that share?
The current framing is obviously ludicrous; it also does the developers of LLMs (the most interesting literary invention since....how long ago?) a huge disservice.
Unfortunately the other way of framing it (the one I'm saying is correct) is (probably) impossible to measure (unless you work for Meta, maybe?) and, also, almost equally ridiculous.
What fair use? Were the books promised to them by god or something?
True, but not the only relevant thing.
If the output of the LLM is "not very different from the original work" then the output could be the infringement. Putting a hypercomplex black box between the source work and the plagiarised output does not in itself make it "not infringing". The "LLM output as a service" business is then based on selling something based other people's work, that they do not have rights to.
It's falling for misdirection, "pay no attention to the LLM behind the curtain" to think otherwise.
I will disagree with that characterisation. IMHO: In some cases no, it's not different, there are clear lines from inputs to output. In some cases yes, it's different from any one input work, it's distributed micro-plagiarism of a huge number of sources. In no case is it original.
But I think that this is legally undecided and won't be decided by you or me, and it is going to be a more interesting and relevant question than "is the LLM model is very like the original work", which it clearly isn't. That's like asking "is this typewriter like this novel?" It can't be, but the words that came out of it could be.
Music has ended up in a place where short audio snippets are protected by copyright and must be licensed; but for short snippets of text the precedent has generally been that the copying needs to be more substantial. Distributed microplagarism of short phases might end up being ruled to be legal, even if wholesale reproduction is not. Which may not give copyright protection to the generated works, of course, as the question of machine authoring is entirely distinct.
That’s like saying the dictionary is micro-plagiarism of a huge number of sources because it uses all the words from those sources.
Plagiarism isn’t necessarily copyright infringement, and plagiarism isn’t illegal. Copyright infringement is.
Even still, your argument that everyone who generates 2,000 words in the style of (author) is plagiarizing is also flatly false. By that standard all English essays that mimic someone else’s style would be plagiarism.
What am I, if not an LLM, ingesting copyrighted materials so that I may improve my own future outputs? Why is my own piracy not protected in the same manner?
You aren't a multi-billion dollar company
The Berne convention mentions "fair practice", and puts the responsibility on the individual countries.
Where’s the threshold for forcing AI companies to retrain models without specific copyrighted works in them?
Legal cases are often based on BS, really an open form of extortion.
The plaintiffs might've been hoping for a settlement.
Meta could pay $xM+ to defend itself.
Maybe they thought Meta would be happy to pay them $yM to go away.
The reality is, there's very little Meta couldn't just find a freely available substitute for if it had to, it might just take a little more digging on their end.
The idea that any one individual or small group is so valuable that can hold back LLMs by themselves is ridiculous.
But you'll find no end to people vain enough to believe themselves that important.
Secondly, there's an argument that the infringement happens only when the LLM produces output based in part of whole on the source material.
In other words, training a model is not infringing in itself. You could "research" with it. But selling the output as "from your model" is highly suspect. Your business is then based on selling something based other people's work, that you do not have rights to.
To make fair use of a book's passage, you have to cite it. The except has to be reasonably small.
Without fair use, it would not be possible to write essays and book reviews that give quotes from books. That's what it's for. Not for having a machine read the whole book so it can regurgitate mashups of any part of it without attribution.
Making a parody is a kind of fair use, but parodies are original expression based on a certain structure of the work.
That's not true. That's what's required for something not to be plagiarism, not for something not to be copyright infringement.
Fair use is not at all the same as academic integrity, and while academic use is one of the fair use exceptions, it's only one. The most you would have to do with any of the other fair use exceptions is credit where you got the material (not cite individual passages), because you're not necessarily even using those passages verbatim.
- of a commercial nature;
- plagiarism;
- substantially large (e.g. whole work);
you're not on good legal footing.
Neither of them died, though, both parties just kept all the books from the public and used them for their own purposes, while normal people had to squirrel them away and trade them illegally. It's the Tech Cartels vs. the Copyright Trolls. It'll end up as a romance.
Letting Meta launder copyrighted works to make billions, while threatening the rest of us over the most trivial derivative work, sounds like the worst outcome to me.
Copyright is a mistake. It demands that we compete instead of collaborate. LLMs don't provide enough utility to deserve special treatment in these circumstances. If anyone can infringe copyright, then everyone should be able to.
https://archive.is/Hg4Xr
Where was this argument when Napster was being sued?
Will the neural network (LLM) itself become illegal? Will its outputs be deemed illegal?
If so, do humans who have read an illegally downloaded book become illegal? Do their creative outputs become illegal?
Also I read that ordinary folks have been arrested for filming in the cinema even if they did not redistribute the video (due to being arrested). Again, it is unfair why they get arrested and Zuckerberg doesn't.
What is the substantive difference between training a model locally using these works that are presumably pulled in from some database somewhere and Napster, for example?
Would a p2p network for sharing of copyrighted works be legal if the result is to train a model? What if I promise the model can't reproduce the works verbatim?
I have this debate with a friend of mine. He's terrified of AI making all of our jobs obsolete. He's a brilliant musician and programmer both, so he's both enthused and scared. So let's go with the Swift example they use.
Performance Artists have always tried to cultivate an image, an ideal, a mythos around the character(s). I've observed that as the music biz has gotten more tough, that the practice of selling merch at shows has ramped up. Social media is monetized now. There's been a big diversification in the effort to make more money from everything surrounding the music itself. So too will it be with artists.
You're starting to see this already. Artists which got big not necessarily because of the music, but because of the weird cult of personality they built. One who comes to minds is Poppy, who ironically enough built a cult of personality around her being a fake AI bot...
https://en.wikipedia.org/wiki/Poppy_(singer)
You've definitely got counter-examples like Hatsune Miku - but the novelty of Miku was because of the artificiality (within a culture that, like, really loves robots and shit). AI pop stars will undoubtedly produce listenable records and make some money, but I don't expect that they will be able to replace the experience of fans looking for a connection with an artist. Watch the opening of a Taylor Swift concert, and you'll probably get it.
Has making music for a living ever not been tough?
Fair.
> I think that argument is further hampered (taylor being an exception) by the fact that most pop stars already don't write their own songs.
That accounts for the big artists on the radio (yes some people listen to that). But, what about everyone else? I would posit that most artists (the one-hit wonders, the ones without radio success, etc.) write their own songs. It seems like there's such acts who make a go of it just fine, who write their own songs and really nail the connection with fans. I would point to a regional band near me: Mr. Blotto.
There's some very impressive youtubers who are claiming to be generating new music with AI. The one I listen to the most I very much doubt has everything 100% generated - he probably generates a bunch of melodies and other bits of track and stiches the best candidates together. They do crank out a new album basically every 2 weeks though - and has just a scant few thousand followers. They are not making money, but the music is pretty on par with bands which sell hundreds of thousands or millions of albums.
This is part of what makes me think it's the people who can cultivate the mythos, the personality, the whole experience, who are going to be the big winners in the AI music economy. Sure, maybe Gorillaz obfuscates the identity of the artists (side note: do they though? it's well known to be a supergroup), but it still is a curated experience that human creativity was leveraged to create the whole experience.
We should be careful not to conflate the affects of copyright to the affects of advertising.
It's going to take centuries to undo the damage wracked by IP-supported private enterprise. And now we also have to put up with fucking chatbots. This is the worst timeline.
edit: i'm serious. many americans would be much happier taking this option if they knew it existed. i may take it myself
You are free to copy bytes as you see fit, and the internet treats them identically whether they are random noise or whether a codec can turn them into music, film, books, or whatever inspires you.
The problem is that some humans, justifying their behavior by claiming it as "official", may act out with violence against you if they (rightly or wrongly, that's important to note) perceive that your actions are causing the internet to copy bytes to which they object.
Enduring nonviolence is likely yet ahead as consensus grows over the end of the legitimacy of these legacy states.