X changes its terms to bar training of AI models using its content

122 bundie 114 6/5/2025, 4:52:10 PM techcrunch.com ↗

Comments (114)

zombot · 4h ago

Right, stealing training data from others is OK, having it stolen from you is not. What else is new?

ivape · 1m ago

X/Twitter has became extremely prohibitive with just about everything since Elon took over. Their API pricing was antagonistic toward even indie developers. Elon is not a generous guy.

keyle · 3h ago

New logo every couple of years and Bob's your uncle.

threeseed · 3h ago

Almost certainly the easter egg found in the Trump "Big Beautiful Bill" which prevents states from enacting AI regulations also came from Musk.

That way he can continue to steal from others and lock competitors out whilst being comfortable knowing that no laws will be enacted to prevent it.

labster · 1h ago

Yep, Musk saying he’s going to fund primary campaigns against congressmembers who vote for the Big Beautiful Bill is all just a brilliant bit of reverse psychology.

Or more likely, Congress is super worried about Roko’s Basilisk.

tetris11 · 1h ago

That's a wild reference!

https://en.wikipedia.org/wiki/Roko's_basilisk

> Roko's basilisk is a thought experiment which states there could be an otherwise benevolent artificial superintelligence (AI) in the future that would punish anyone who knew of its potential existence but did not directly contribute to its advancement or development, in order to incentivize said advancement.

stuaxo · 55m ago

And some of the CEOs of LLM companies seem to believe in it, and that "AGI" will come from their LLM work - both of which are utterly insane points of view.

ilyagr · 16m ago

An intelligence that reasons this way would be, in human terms, batshit insane and completely immoral. So, it seems unlikely that many or maybe any humans would experience it as "otherwise benign" if it had power over their lives.

And if we do get an all-powerful dictator, we will be screwed regardless of whether their governing intelligence is artificial or composed of a group of humans or of one human (with, say, powerful AIs serving them faithfully, or access to some other technology).

BoxOfRain · 54m ago

It's Pascal's Wager with a sci-fi reskin, and all the objections that go along with that.

mgoetzke · 3h ago

why do you think he is so evil but all others are benign ?

littlestymaar · 2h ago

None of them are benign. He's the only one to have been in a government office though, and he's also batshit crazy, which makes him even more dangerous than the other oligarchs.

HenryBemis · 52m ago

He is not "batshit crazy", or maybe he is. But he is making the next generation of ICBMs for the US government, sorry.. he is making super-duper rockets that will definitely take people to Mars and his companies/creations will be the very first tech ever to _not_ be used for war and death!!! (he wrote while laughing). So that settles it (all).

thih9 · 1h ago

I think the rules should be stricter.

I’d prefer an explicit opt in from the content author being required for anyone to perform any model training with any given data.

Alternatively, require all weights, prompts and chat logs to have the same visibility as the original datasets.

None of this is going to happen and current decisions about uncopyrightable ai[1] are already good; but still, it feels like there is room for abuse.

[1]: https://en.m.wikipedia.org/wiki/Th%C3%A9%C3%A2tre_D%27op%C3%...

visarga · 3h ago

Copyright is not going well. The rights of millions of people are trampled by companies, both the content we post on social networks and our private AI chats. Our voice doesn't matter.

Copyright was supposed to protect expression and keep ideas freely circulating. But now it protects abstractions (see the Abstraction-Filtration-Comparison test). It is much more difficult to be sure you are not infringing.

eviks · 14m ago

It seems like it was supposed to do the exact opposite per cursory wiki reading:

> The concept of copyright first developed in England. In reaction to the printing of "scandalous books and pamphlets", the English Parliament passed the Licensing of the Press Act 1662,[16] which required all intended publications to be registered with the government-approved Stationers' Company, giving the Stationers the right to regulate what material could be printed.[20]

> The Statute of Anne, enacted in 1710 in England and Scotland, provided the first legislation to protect copyrights (but not authors' rights)

lesuorac · 15h ago

Who's training an AI on the "Tweet" button text?

Or are they trying to forgo section 230 protection and claim ownership of content uploaded to the site?

lambertsimnel · 3h ago

Perhaps they want the prohibition on using the site content for AI training to be considered based on something other than their ownership of it, like bandwidth usage or users' rights

HenryBemis · 50m ago

They will get paid to share our (your) data and they will use the money for infra and new yachts.

lambertsimnel · 20m ago

Indeed, but I'm speculating that they do that without owning the data or even claiming to. That's consistent with the article, but I haven't read the other relevant documents. Maybe they have a license to use the data. Maybe the license allows or requires them to try to restrict others' AI training, regardless of their non-ownership of it. Maybe that serves multiple purposes, in which case they could point to whichever shows them in the best light.

cameldrv · 15h ago

Naturally I'm sure Grok reads the terms of service on every website it scrapes and doesn't use content from sites that prohibit it.

Animats · 15h ago

It would be interesting to have a "classical AI model", trained on the contents of the Harvard libraries before 1926 and now out of copyright.

gausswho · 15h ago

It does surprise me that we haven't seen nations revise their copyright window back to something sensible in a play to seed their own nascent AI industry. The American founding fathers thought 20 years was enough. I'm sure there'd be repercussions in the banking system, but at some point it might be worth the trade.

blibble · 15h ago

they can't

a 50 year minimum is part of the berne convention, which itself is as close to a universal law as humanity has

(even North Korea is a signatory)

loudmax · 13h ago

The current US copyright duration is 70 years after the life of the author. This is absolutely bonkers. 50 years from publication would be a significant improvement.

50 years ago was 1975. If copyright were limited to 50 years, we'd be looking at all of the Beatles works being in the public domain. We'd be midway though Led Zeppelin, and a lot of the best work from Pink Floyd and the Rolling Stones.

Also, Superman, Batman, and Spider-Man. Disney would still profit from the MCU films which they produced in the 2010's, but they couldn't stop you from releasing your own Batman vs Spider-Man story.

The Harry Potter books would still belong to JK Rowling, but the Narnia stories would be available for all.

The Godfather 1 and 2 would be in the public domain, as would be original Star Trek TV show, and we'd be coming up on Star Wars pretty soon.

If there were no copyright protection, these works wouldn't have been created. It is good that Paul McCartney and George Lucas and JK Rowling have profited from their creative output. It would be okay if they only profited for the first 50 years. Nobody is counting on revenue over half a century in the future when they create a work of art today.

This is our culture. It should belong to all of us.

jfim · 12h ago

> Disney would still profit from the MCU films which they produced in the 2010's, but they couldn't stop you from releasing your own Batman vs Spider-Man story.

Wouldn't they still have a trademark on those characters though?

ncallaway · 4h ago

The trademark on characters is related to selling goods, if the character is used as a way of identifying an authentic seller.

So, if Disney is using mickey mouse on t-shirts to identify it as a Disney manufactured t-shirt, you wouldn't be allowed to use mickey mouse on t-shirts in a similar fashion in a way that might cause consumer confusion about who manufactured the t-shirt.

If Wolverine was in the public domain, then they couldn't use a Wolverine trademark to stop you from selling a Wolverine comic book. However, if they used a _specific_ Wolverine mark to identify it as a Disney Wolverine book, then you'd be restricted from using that.

Basically, trademark exists to prevent consumer confusion about who is the creator that is selling a good.

tpxl · 4h ago

> If there were no copyright protection, these works wouldn't have been created.

Citation needed. You can freely copy and distribute linux and it still got made.

simiones · 20m ago

I think Linus Torvalds has been very explicit that he believes the GPL has been critical to the success of Linux - specifically, the copyright-enforced obligation to contribute back any modifications you make. In a world without copyright, companies would be free to make their own modifications and keep them secret, making it more or less impossible to integrate them into a cohesive whole the way they are more or less forced to do today.

lmm · 4h ago

Linux is generally a functional tool, and struggles with overall coherence. There are far fewer success stories of artworks being made in this style. (E.g. there are successful multiplayer open-source games or clones of existing games, but very few original single-player games, and those that there are are largely the work of a single individual)

mattkevan · 2h ago

Most of the classic Disney films are based on public domain stories.

If there were copyright, those works wouldn’t have been created.

pastage · 4h ago

Linux has used the GPL to its advantage. That can not exist without copyright. (The two camps in copyright discussions, improving it e.g. CC, or destroy it)

AStonesThrow · 4h ago

The GP wasn't referring to DRM or DMCA type "copyright protection" as the phrase is typically used. Nobody in this thread has mentioned any of that.

The GP is referring to legal protections, and guess what?

Linux is legally protected by copyright!

Nearly every GPL license--every one that we could name--protects a copyrighted work! Nearly every GFDL, AGPL, LGPL protects works by means of copyright law!

Can you imagine that? So do the Apache license, the BSD licenses, the MIT license! Creative Commons (except for CC0) these licenses are legally protecting copyrighted works. Thank you!

Now everyone who proposes to draw down limits on copyright coverage and reduce the length of terms and limit Disney from their Mouse rights, y'all are also proposing the same limits on GPL software, such as Linux, and nearly every work with a license from the above list -- all of Wikimedia Commons, much of Flickr.com, all your beloved F/OSS software will be subject to the same limitations and the same restrictions you want to put on Paramount and the RIAA's labels.

bornfreddy · 3h ago

Yeah, I think most of us are fine with 50 years old Linux kernel being released into public domain.

ronsor · 15h ago

you can also just ignore the berne convention, and accept whatever consequences there might be

blibble · 15h ago

this would void the copyrights of your citizens and companies

essentially forever

ronsor · 14h ago

If enough "relevant" countries do it, that either won't happen or won't matter. If the U.S. ditches it, no one is going to do much more than throw a brief fit.

blibble · 10h ago

the US is the main beneficiary of copyright law...

littlestymaar · 1h ago

The US copyright corporations, indeed. But the current copyright laws come at a big expense for the public.

Abolishing copyright laws altogether would be nuts, but the current laws are nuts too and there's lots of room in between.

dreghgh · 1h ago

Iran enforces domestic copyright internally but not international copyright.

godelski · 14h ago

Seems to be the modus operandi

  If TikTok is banned, here’s what I propose each and every one of you do: Say to your LLM the following: “Make me a copy of TikTok, steal all the users, steal all the music, put my preferences in it, produce this program in the next 30 seconds, release it, and in one hour, if it’s not viral, do something different along the same lines.”

https://www.theverge.com/2024/8/14/24220658/google-eric-schm...

https://news.ycombinator.com/item?id=41275073

AStonesThrow · 14h ago

The last time I attended a Berne Convention, every panel was just overrun with Trekkies, especially Klingons, in the hotel lounges too. And the autograph lines were interminably long, and the vendors were trying to sell us their Public Domain stuff. It was nothing like San Diego Comic-Con!

Teever · 5h ago

Europe has recently introduce a law[0] that allows them to suspend IP protections as a punitive response to coercive economic actions by bad actors.

> The procedure is activated by the European Commission submitting a request to the Council of the European Union.[2] After a period of negotiation with the country performing the coercion, the European Council can decide to implement "response measures" such as customs duties, limiting access to programs and financial markets, and intellectual property rights restrictions.[2][4] These restrictions can be applied to states, companies, or individuals.[4]

[0] https://en.wikipedia.org/wiki/Anti-Coercion_Instrument

littlestymaar · 2h ago

The Bern Convention on Copyright is an international convention, like the Treaty of Versailles or the Paris Agreement, it could meet the same fate.

babypuncher · 15h ago

50 year copyright terms would still be a big improvement over the current state of US copyright law. That would make the first Star Wars public domain in just 2 years.

gausswho · 13h ago

would there be repercussions if a country hewed to the 50 year minimum?

MattGaiser · 15h ago

Why would it matter? Copyright has been irrelevant so far.

kibwen · 15h ago

Careful, you might create an artificial superintelligence that way. Safer to just train on the Twitter dataset.

Shadowmist · 3h ago

that’s how you end up with an Artificial Idiot.

mbg721 · 15h ago

If you thought AI now had out-of-control racism...

carlio · 14h ago

It'd look like this: https://www.smbc-comics.com/comic/copyright

nickpsecurity · 15h ago

I wish someone would update and use PG19 for 7-30B+ model:

https://github.com/google-deepmind/pg19

That gives us a model that's 100% open and reproducible with low, legal risk. It would also be a nice test of how much AI's generalize from or repeat behavior in their pretraining data.

Then, a new model using that, The Stack, and FreeLaw's stuff (by paying them to open source it). No Github Issues or anything with questionable licenses or terms of service violations. That could be the next baseline for lawful models with coding ability, too. Research in coding AI's might use it.

murph-almighty · 14h ago

I've similarly wondered if I could get a pre-2024 Wikipedia if just for the "fact based" flavor LLM

landl0rd · 4h ago

Do you think Wikipedia starting in '24 was polluted by AI slop? This is certainly possible, I'm just not aware of it happening.

Wikipedia periodically publishes database dumps and the Internet Archive stores old versions: https://archive.org/search?query=subject%3A%22enwiki%22%20AN...

Plus you could also grab the latest and just read the 12/31/23 revisions.

malinens · 5h ago

What happened to wikipedia in 2024?

blibble · 15h ago

wish I could change my terms to bar training of AI models on my content

unstablediffusi · 15h ago

if that is any consolation, no one gives a shit about xitter's ToS either. it will continue to be scrapped by every major player.

Capricorn2481 · 2h ago

How exactly is it being scraped? My understanding is Twitter and LinkedIn are both huge pains in the ass to scrape right now.

No comments yet

vouaobrasil · 15h ago

Same here! It should be a default. Unfortunately, the very openness of the internet is now working against us.

soulofmischief · 15h ago

Why should it be a default? Can you prove that training a model on data you wrote is not fair use?

We're already seeing precedent that it might be.

https://www.ecjlaw.com/ecj-blog/kadrey-v-meta-the-first-majo...

The openness of the internet is a good thing, but it doesn't come without a cost. And the moment we have to pay that cost, we don't get to suddenly go, "well, openness turned out to be a mistake, let's close it all up and create a regulatory, bureaucratic nightmare". This is the tradeoff. Freedom for me, and thee.

shakna · 2h ago

Yeah, I don't think downloading my paid-for books, from an illegal sharing site, to scrape and make use of, is in any way fair use.

From the decision in 1841, in the US (Folsom vs Marsh):

> reviewer may fairly cite largely from the original work, if his design be really and truly to use the passages for the purposes of fair and reasonable criticism. On the other hand, it is as clear, that if he thus cites the most important parts of the work, with a view, not to criticize, but to supersede the use of the original work, and substitute the review for it, such a use will be deemed in law a piracy

Further, to be "transformative", it is required that the new work is for a new purpose. It has to be done in such a way that it basically is not competing with the original at all.

Using my creative works, to create creative works, is rather clearly an act of piracy. And the methods engaged, to enable to do so, are also clearly piracy.

Where would training a model here, possibly be fair use?

baseballdork · 15h ago

The burden is on the user to show that it is fair use, no? Not everyone else's responsibility to prove that it's _not_ fair use.

soulofmischief · 14h ago

It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use, they have to show how it violated law, and while it's in the best interest of those organizations to make things easier for the court by showing why it is fair use, they are technically innocent until proven guilty.

Accordingly, anyone on the internet who wants to make comments about how they should be able to prevent others from training models on their data needs to demonstrate competence with respect to copyright by explaining why it's not fair use, as currently it is undecided in law and not something we can just take for granted.

Otherwise, such commenters should probably just let the courts work this one out or campaign for a different set of protection laws, as copyright may not be sufficient for the kind of control they are asking over random developers or organizations who want to train a statistical model on public data.

lmm · 4h ago

> It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use, they have to show how it violated law, and while it's in the best interest of those organizations to make things easier for the court by showing why it is fair use, they are technically innocent until proven guilty.

No, fair use is an affirmative defense for conduct that would otherwise be infringing. The onus is on the defendant to show that their use was fair.

SAI_Peregrinus · 14h ago

You've got it backwards. It's on the defendant to prove that their use is fair. The plaintiff has to prove that they actually own the copyright, and that it covers the work they're claiming was infringed, and may try to refute any fair-use arguments the defense raises, but if the defense doesn't raise any then the use won't be found fair.

soulofmischief · 12h ago

It's true that the process is copyright strike/lawsuit -> appeal, but like I said, it's in their best interests to just prove that it's fair use because otherwise the judge might not properly consider all facts, only hear one side of the story and thus make a bad judgement about whether or not it is fair use. If anything, I'm just being pedantic, but we do ultimately agree here I think.

petesergeant · 2h ago

> It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use

Morally, perhaps, but not under US law: https://en.wikipedia.org/wiki/Affirmative_defense#Fair_use

bamboozled · 26m ago

This guy is just painful

michaelcampbell · 16h ago

"its content" indeed.

lcnmrn · 2h ago

I allow all robots and even provide a sitemap on Subreply, a social network I created.

kyle-rb · 14h ago

I've never signed up for the X developer program, so I'm not bound by these terms. But I did download an archive of my data last week. Do I have implicit permission to use that data (~150k liked tweets) to train AI models?

Or is there stuff in the user agreement that separately prohibits this?

Obviously barring normal copyright law which is still up in the air.

josefritzishere · 14h ago

If you live in the EU, GDPR dictates that you own your data generally speaking. If you're in the US it varies by state if you have any rights at all.

MoonGhost · 14h ago

If you own your face that doesn't mean nobody can take a picture on the street.

matwood · 16h ago

Weird this just happened. I assumed all sites with any sort of content changed their terms soon after ChatGPT hit the scene.

nailer · 15h ago

Yep, from https://the-decoder.com/reddit-ends-its-role-as-a-free-ai-tr... :

You must not, and must not allow those acting on your behalf to:

...use the Data APIs to encourage or promote illegal activity or violation of third party rights (including using User Content to train a machine learning or AI model without the express permission of rightsholders in the applicable User Content);

soulofmischief · 15h ago

In my eyes that is considered fair use, and I think the courts will come to agree unless they are financially incentivized to look the other way and thus create a moat for existing players at the expense of newcomers.

seydor · 4h ago

VAT for content should be a thing. Ultimately all users should be getting paid

guywithahat · 4h ago

So I get to use the platform for free, but I also get paid to post on the platform? I'm not sure that makes sense. Like I hate to take the side of big tech, but they can't literally be paying users to use their platform. Just use something else, there are a million social media sites

MonkeyClub · 3h ago

> I get to use the platform for free

You actually get to generate content for the platform for free.

Without you (all of the X users), the platform would be devoid of content, just botspeak and corporate promos.

Plus, as the sibling mentioned, they monetize your visit through ads (and data use).

jaoane · 3h ago

Most posts are ignored and are an absolute loss to the company. Which is why platforms like Twitter only allow you to make money from posting once you reach a certain threshold.

seydor · 4h ago

Google indexes your website for free, and it will pay you to put ads in it.

That's also what all social media do , they put ads on your thoughts. They dont even need to index your thoughts because you submit them directly. It has nothing to do with being free, it's about incentives. Users are so foolish , they give everything for free, unlike webmasters.

Reason077 · 4h ago

You don’t use the platform for free, unless you’re using an ad blocker. But that’s also, probably, against the TOS?

threeseed · 3h ago

We really need LLMs for music to become more advanced.

Then maybe the recording companies will start defending artist rights.

Because not sure what all the other industry bodies are doing.

mk_stjames · 2h ago

I wanted to do some quick math on this idea- supposed we trained a vanilla transformer model from scratch, as GPT2/GPT3 was done- the number of seen input tokens is known perfectly, as is the sources of those training tokens (since then, everyone has either kept quiet about the sources post-Books3-fiasco, or have been finetuning on top of previous models making this more difficult of a calculation)

GPT-3 was trained on approximately 300 billion tokens. An small sized technical textbook might contain something like... 130,000 tokens? (1 token ~= 0.75 words, ~100k words in the book).

Thus, say you wrote a textbook on quantum mechanics that was included in the training corpus. A naive computation of the fraction of your textbook's contribution to the total number of training tokens would be 300B/130K = 0.0000004333333333, or 0.000043%.

If our hypothetical AI company here reported, say $500M in yearly profit, if all of that was distributed 100% based on our naive training token ratio (notice I say naive because it isn't as simple to say that every training token contributes equally to the final weights of a model. That is part of the magic.) then $500M * 0.000043% = $215.

You could imagine a simpler world where it was required by law that any such profitable company redistribute, say, %20 (taking the 'anti-VAT' idea) back to the copyright holders / originators of the training tokens. So, our fictitious QM textbook author would receive a check in the mail for $43 for that year of $500M in revenue. Not great, but not zero.

Since then, training corpuses are much, much larger, and most people's contributions would be much smaller. Someone who writes witty tweets? Maybe 1/100th the length of our above example in am model with now 100x the training corpus.

So fractions of a penny for your tweets. Maybe that is fitting after all...

petesergeant · 2h ago

The only story here is that it took 2 months for them to do this after being "bought" by xAI.

delichon · 16h ago

> “You shall not and you shall not attempt to (or allow others to) […] use the X API or X Content to fine-tune or train a foundation or frontier model,” it reads.

If I have a service where a user enters any URL, like a tweet from X, and the service translates it, then if the user approves of the translation I train a translation model on that, does that violate this term?

yandie · 16h ago

Per my experience with GenAI legal teams, that’s a no go.

It’s not been tested in court though

dyauspitr · 5h ago

If you don’t want an LLM to view it don’t put it on the public internet.

ronsor · 15h ago

I'm not sure how this will work as crawlers don't read or accept ToS.

voidUpdate · 54m ago

This refers to the API, which you would have to manually attach a bot to so that it could scrape things

MoonGhost · 14h ago

It will not as long as search engines have access. Which means Google and OpenAI through MS Bing, that's at least.

Without search engines what the point in posting it on open net if nobody can find.

archagon · 11h ago

Oh, that must be nice. And what should I do as a blogger to get the same privilege for my content?

We are in an age of corporate “piracy for me, but not for thee.”

MonkeyClub · 3h ago

> We are in an age of corporate “piracy for me, but not for thee.”

Rather, we are back to that age of state- (now corporate-) backed privateering.

echelon · 16h ago

If an artist or author can't do this, social media shouldn't be able to do it either.

If Xai wants to train on public corpus, it shouldn't be allowed to prevent its own corpus from being used.

We need regulations to limit the power grabs. Train all you like, but don't dare try to constrain to your walled gardens.

We should also probably nip the "foundation model company / also a social media company" conglomeration in the bud.

teeray · 16h ago

> If an artist or author can't do this, social media shouldn't be able to do it either.

Even if this is done, the case of starving artist v. megacorp will probably go to whoever wields the most money and lawyers. To add insult to injury, the artist’s opponent is fueled by their ill-gotten gains.

yndoendo · 15h ago

This is dependent on country. USA, yes with their draconian methods. Countries like the UK, the looser of the suit pays all the cost. UK layers have no problem taking low wealth client cases they know will win. UK allows for David vs Goliath and David to win. US up lifts Goliath as a God.

bonoboTP · 15h ago

Also in many countries legal costs are just generally lower than in the US.

mgraczyk · 16h ago

Artists can do this, and they do

loudmax · 15h ago

Yes, but do artists have the ability to actually monitor and enforce this? You have to have the capacity and the wherewithal and to test these models to even know that your data is being ingested into AI.

Big companies like the New York Times and Twitter/X have the funds to pay for this. Miscellaneous artists probably don't.

jimbokun · 15h ago

If social media can do this, an artist or author should be able to do it, too.

vouaobrasil · 15h ago

Social media should do it to set a legal precedent.

> We need regulations to limit the power grabs. Train all you like, but don't dare try to constrain to your walled gardens.

No, no one should train, period.

echelon · 14h ago

> No, no one should train, period.

I get that you have your own opinion, but I'm personally tired of living in the butter-churning era and would prefer that this all went a bit faster.

I want my real time super high fidelity holo sim, all of my chores to be automatically done, protein folding, drug discovery. The life extension, P = NP future. No more incrementalism.

If the universe only happens once, and we're only awake for a geological blink of an eye, I'd rather we have an exciting time than just be some paper-pushing animals that pay taxes and vanish in a blip.

I'd be really excited if we found intelligent aliens, had advanced cloning for organ transplants and longevity, developed a colony on Mars, and invented our robotic successor species. Xbox and whatever most normal people look forward to on a day to day basis are boring.

vouaobrasil · 12h ago

There is already a beautiful, exciting world out there full of animals and plants and we don't need AI or some computer crap to experience it. The problem is, creating all this AI and advanced technology is directly crushing that world.

risyachka · 2h ago

Good luck with that. Pretty sure at this point no one cares.

Literally every AI model is trained on copyrighted etc data. And without any consequences.

add-sub-mul-div · 15h ago

How useful is low-quality content like Youtube comments and tweets anyway? Is it a common/important use case to generate tweet-length, tweet-quality content? Are most use cases of generating tweet-type content spam/fraud? Would a model be better off if it was unable to perform those use cases?

redox99 · 15h ago

Even if SNR is low, there is some information that only exists on X, or at least is the primary source. Just look at how many submissions on HN are X posts.

add-sub-mul-div · 15h ago

Before Musk bought it Twitter was broadly disliked here and there were regularly calls in the comments to disallow submissions from there. Given how it's degraded in completely non-partisan ways (blocking of alternative clients, features removed from free tier, paid subscription tiers below $40/month still have ads, proliferation of spam from paid placement bots in comments) I can't understand how positive sentiment comes from a place other than virtue signaling alignment with Musk and his values.

vouaobrasil · 15h ago

There needs to be a worldwide standard, such as an HTML tag, that says "no training". And a few countries need to make it a punishable offense to violate the tag. The punishment should be exceptionally severe, not just a fine. For example: any company that violates the tag should be completely barred from operating, forever.

kiratp · 15h ago

That will play out exactly like the "Do not track" bit did.

vouaobrasil · 15h ago

Perhaps we should try anyway, in case you are wrong.

anigbrowl · 15h ago

That will just lead to situations where one company scrapes the site, cleans the content of tags, and sells the data, and another does the training on the precleaned data. The first one hasn't trained and the second one never saw the tag.

vharuck · 15h ago

This isn't a new concept in law. It's similar to buying goods that were stolen or procured through illegal means. Here's the US law that applies when it happens across state lines:

https://www.law.cornell.edu/uscode/text/18/2315

Note that it requires the defendant to know the goods were illegally taken. Can be hard to prove, but not impossible for companies with email trails. The fun question is, what will the analog be for the government confiscating the illegally "taken" data? A guarantee of deletion and requirement to retrain the model from scratch?

vouaobrasil · 15h ago

Companies who are found guilty of this should also be rendered bankrupt then.

twostorytower · 15h ago

It needs to be incorporated into the robots.txt standard.

logicchains · 15h ago

>There needs to be a worldwide standard, such as an HTML tag, that says "no training"

Any country that seriously implemented this would just end up being completely dominated by the autonomous robot soldiers of another country that didn't, because it effectively bans the development of embodied AGI (which can learn live from seeing/reading something, like a human can).

Effectiveness Beats Accuracy (dynomight.net)

The Sage, the Kid, and the Psychopath (articles.akatski.com)

MimeTypeCore – All the MIME/file extension pairs you will ever need (github.com)

Plans you're not supposed to talk about (dynomight.net)

HPE Uses AI to Drive the Business, Which Is Increasingly AI (nextplatform.com)

Endangered classic Mac plastic color returns as 3D-printer filament (arstechnica.com)

Show HN: Camus – The World's First Truly Useless AI Agent (camus.im)

Magic Namerefs (gist.github.com)

Commanding Your Claude Code Army (steipete.me)

More than a hundred backdoored malware repos traced to single GitHub user (theregister.com)

WaveGuessr – GeoGuessr for Waves (waveguessr.com)

The Qwen3 Embedding Model (huggingface.co)

Z/OS Metal I/O – Making Developers' Lives Better (makingdeveloperslivesbetter.wordpress.com)

Have LLMs Mastered Geolocation? (bellingcat.com)

A Proposed Mechanism for Me/CFS Invoking Macrophage FcγRI and Interferon Gamma (qeios.com)

Freight rail fueled a new luxury overnight train startup (freightwaves.com)

The bromance is over – no one will miss it (German) (surplusmagazin.de)

FL Woman Fined $165K for Trivial Code Violations Takes Case to FL Supreme Court (reason.com)

Using Generative AI to Create a Digital Doppelgänger (rishimodha.substack.com)

From Endeavouros to Pop!_OS (aumont.fr)

AI Agent Friday Finds Post, 2025-06-06 (sebgnotes.substack.com)

A Sketch of Reversible Deterministic Concurrency for Distributed Protocols (replica-io.dev)

Faster remainder by multiplication, with applications to compilers and software (arxiv.org)

Ask HN: Anyone else feeling increasingly alienated from the industry?

What do you all think of the latest Apple paper on LLM capabilities? [pdf] (ml-site.cdn-apple.com)

Want to Create Professional Charts Fast? Try the Free AI Graph Maker (aigraphmaker.net)

Tesseral: Open-source auth infrastructure for B2B SaaS (tesseral.com)

Algebra Unveils Deep Learning – An Invitation to Neuroalgebraic Geometry (arxiv.org)

I've Soured on Go (nickblow.tech)

Falsehoods Programmers Believe About Aviation (flightaware.engineering)

Dual RTX 5060 Ti 16GB vs. RTX 3090 for Local LLMs (hardware-corner.net)

A Programming System (2023) (andreyor.st)

The People's Republic of iPhone (newstatesman.com)

xAI co-founder Christian Szegedy joins Morph (twitter.com)

Is Datastream MongoDB to BigQuery a solid offering?

Cognee+Memgraph: How to Build Intelligent Knowledge Graph Using Hacker News Data (memgraph.com)

Pornhub pulls out of France over age verification law (bbc.com)

Apps shouldn't let users enter OpenSSL cipher-suite strings (00f.net)

Show HN: A scriptable text editor for LLMs (github.com)

How Anthropic Uses Claude Code (twitter.com)

George Legrady – Artist Exploring AI Aesthetics (2025) [video] (vimeo.com)

Captured a Summer Vacation on a PowerBook (jasoneckert.github.io)

Global Building Atlas (github.com)

Show HN: Find your next cybersecurity job (cyber-security.careers)

The golden era of flying is now (theupwing.com)

Ask HN: What are your fav/goto decision making hacks/heuristics?

Crayfish Plague (en.wikipedia.org)

Silicon Valley wants to help me make a superbaby. Should I let it? (sfstandard.com)

Freedesktop team member closes all open xserver merge requests (gitlab.freedesktop.org)

I accidentally hacked a hotel switchboard (myit.substack.com)

X changes its terms to bar training of AI models using its content

Comments (114)