Reddit blocks Internet Archive to end sneaky AI scraping (arstechnica.com)

Japan has extremely favorable copyright laws to the holders. My understanding is that without explicit permission, there is no fair use and so any reproduction or modified work is only allowed as long as they don't request a takedown.

beepbooptheory · 1h ago

From tfa:

> Japan’s copyright law allows AI developers to train models on copyrighted material without permission. This leeway is a direct result of a 2018 amendment to Japan’s Copyright Act, meant to encourage AI development in the country’s tech sector. The law does not, however, allow for wholesale reproduction of those works, or for AI developers to distribute copies in a way that will “unreasonably prejudice the interests of the copyright owner.”

Alex4386 · 21m ago

tl;dr: If you are not directly affecting the "sales" of the product, you are good to go. But It seems perplexity did, and (as they might call it) directly trying to compete as a news source

Personally, About their news service, Their news summarization is kinda misleading with AI hallucination in some places.

SilverElfin · 57m ago

I don’t understand why corporations can violate copyright laws at hyper scale but individuals are banned from small scale piracy through authoritarian internet governance.

ranyume · 2m ago

It's because people are allowing corporations, the elite, governments, to do as they please. In this hedonistic, shallow, era nobody wants to sacrifice themselves for a cause. Except some rare cases like luigi.

mlinhares · 15m ago

The law only exists for those without enough money and influence to control the enforcers.

ants_everywhere · 2h ago

If they are copying and pasting news articles on their site, that's a pretty straightforward copyright case I would think.

In the US at least this should be pretty well covered by the case law on news aggregators.

totetsu · 1h ago

The Japan Newspaper Publishers & Editors Association is very active lobbying about this area https://www.pressnet.or.jp/english/

ujkhsjkdhf234 · 2h ago

Before someone mentions Japan effectively making all data fair use for AI training, Japan specifically forbids direct recreation which is what this lawsuit is about.

aspenmayer · 3h ago

Original title edited for length:

> Japan’s largest newspaper, Yomiuri Shimbun, sues AI startup Perplexity for copyright violations

charcircuit · 1h ago

It's best not to crawl Japanese newspapers. Japan does not have the same kind of fair use. Even reproducing facts from a newspaper can be infringing.

ronsor · 2h ago

I don't know why Perplexity in particular gets everyone in a nit. It's not even particularly special: a user inputs a query, an AI model does a web search and fetches some pages on the user's behalf, and then it serves the result to the user.

Putting aside that other products, such as OpenAI's ChatGPT and modern Google Search have the same "AI-powered web search" functionality, I can't see how this is meaningfully different from a user doing a web search and pasting a bunch of webpages into an LLM chat box.

> But what about ad revenue?

The user could be using an ad blocker. If they're using Perplexity at all, they probably already are. There's no requirement for a user agent to render ads.

> But robots.txt!!!11

`robots.txt` is for recursive, fully automated requests. If a request is made on behalf of a user, through direct user interaction, then it may not be followed and IMO shouldn't be followed. If you really want to block a user agent, it's up to you to figure out how to serve a 403.

> It's breaking copyright by reproducing my content!

Yes, so does the user's browser. The purpose of a user agent is to fetch and display content how the user wants. The manner in which that is done is irrelevant.

Alex4386 · 17m ago

Well, some bots even spoof User-Agents, requesting tons of requests without proper rate-limiting (looking at you, ByteSpider)

No fair plays done by people, even before the LLMs, so we get the PoW challenge on everywhere.

And what is that conclusion? since Adblockers are used by anywhere, it is OK to corporates not to license them directly and just yank them and put it into curation service? I really don't think this is in a good faith though

also calling browser itself as reproducing? Yes, the data might be copied in memory (but I wouldn't call it as reproducing material, more like transfer from the server to another), but redistribution is the main point here.

It's like saying well, "the part of the variable is replicated to register from the L2 cache, so whole file on DRAM can be authorized to reproduce", Your point of calling "it's reproducing and should not be reproduced in first place" can't be prevented unless you bring non-turing computers that doesn't use active memory.

jaredwiener · 41m ago

There's a difference between what is technically feasible and what is allowed, legally or even morally.

Just because it is possible -- or even easy -- to essentially steal from newspapers/other media outlets, doesn't make it right, or legal. The people behind it put in labor, financial resources, and time to create a product that, like almost every other service, has terms attached -- and those usually come with some form of monetization. Maybe it is a paywall, maybe it is advertisements -- but it is there.

Using an adblocker, or finding some loophole around a paywall, etc, are all very easy to do technically, as any reader of this site knows. That said, the media outlet doesn't have to allow it. And when it is violated on an industrial scale, like Perplexity, then they can be understandably upset and take legal action. And that includes any AI (or other technology, for that matter) that is a wrapper around plagiarism.

Sites opted in to Google originally because it fed them traffic. They most likely did not opt in to an AI rewriter that takes their work and republishes it without any compensation.