(1) Google has been slowly killing the web for 15 years with a completely unfair situation. See http://www.seobook.com/blog for a blog that has been telling this story... for 15 years.
To me, getting worried about it now is like somebody in the UK today getting worried about the American revolution or being shocked that Kennedy got shocked or a Jets fan being excited that they won the Superbowl in 1969. If anything, LLMs are more democratic in that you can get them to work for you but you'll never get to touch (short of being a shareholder) the money that Google makes from adtech.
(2) With the growth of reasoning models I'm not so sure that LLMs really need copyrighted data. If you were going to train a model with 100x as much data it would have to be synthetic anyway. It will still have a blind spot in the "dark ages" of 1930-2001 between the public domain and creative commons but that blind spot affects everything.
To me, getting worried about it now is like somebody in the UK today getting worried about the American revolution or being shocked that Kennedy got shocked or a Jets fan being excited that they won the Superbowl in 1969. If anything, LLMs are more democratic in that you can get them to work for you but you'll never get to touch (short of being a shareholder) the money that Google makes from adtech.
(2) With the growth of reasoning models I'm not so sure that LLMs really need copyrighted data. If you were going to train a model with 100x as much data it would have to be synthetic anyway. It will still have a blind spot in the "dark ages" of 1930-2001 between the public domain and creative commons but that blind spot affects everything.