Tokasaurus: An LLM inference engine for high-throughput workloads (scalingintelligence.stanford.edu)

X/Twitter has became extremely prohibitive with just about everything since Elon took over. Their API pricing was antagonistic toward even indie developers. Elon is not a generous guy.

newsbinator · 6h ago

> Elon is not a generous guy

Why would he be?

foobarchu · 11m ago

Maybe something to do with having built his fortune off the back of taxpayer subsidies?

notsosureja1 · 3h ago

Because it feels warm and fuzzy to be kind and empathic. Being hateful and greedy and letting avarice rule over your worldview is incredibly sad. But who am I to say.

ivape · 5h ago

It's kind of a "life arc" that gets fulfilled when you've done it all and have all the money in the world, and reach a certain age. It's a very traditional arc for a humane human being.

thomasanders0n · 17m ago

He still has a couple decades to go with his companies I would say.

reaperducer · 5h ago

> Elon is not a generous guy

Why would he be?

Why shouldn't he be?

He has 10x more of everything in the world than he could ever possibly use in his lifetime.

Greed is not a virtue.

djaychela · 4h ago

> He has 10x more of everything in the world than he could ever possibly use in his lifetime.

Your multiplier is miles off. Not only on basic maths but because he has no idea what to do with all of his wealth other than accrue more and try to prove he's still not the unlikeable teenager he was in SA.

Without a rounding error on his wealth he could fix world wide problems such as clean drinking water for everyone. Instead he follows his self-made "I'm a genius" agenda.

I know there will be no actual day of reckoning for him, but if there were he would have a lot of difficult questions and no decent answers.

ryeats · 29m ago

Not justify anything he does or does not do but this is clearly not the case since he had to take out loans against equity in his other companies to buy Twitter.

MarcelOlsz · 4h ago

My uncle has 10x more of everything in the world than he could ever possibly use in his lifetime. A lake house, a main house, a few boats and cars.

Elon is somewhere around 10,000x.

Barracoon · 1h ago

The median American net worth is $192,700. Elon’s net worth is $393.4 billion, so if I’m doing math right he’s about 204,000,000x more

No comments yet

threetonesun · 4h ago

When twitter became x they switched to basically the same limits Instagram has, I don't think this is a particular failing of Elons, even though he might have many.

Restricting content from AI is the big messy debate we're going to see over and over for the next who knows how many years.

matthewdgreen · 3h ago

Twitter's strategy was to keep the platform very open and inviting, in order to make it relevant. This included having a relatively unrestricted API compared to other platforms.

I don't know if this was successful or not. Ultimately they convinced someone to buy the platform for $44bn, so I guess you can say it was. That buy has locked the platform down more, and the new version certainly feels less culturally central and relevant than it used to.

threeseed · 10h ago

Almost certainly the easter egg found in the Trump "Big Beautiful Bill" which prevents states from enacting AI regulations also came from Musk.

That way he can continue to steal from others and lock competitors out whilst being comfortable knowing that no laws will be enacted to prevent it.

api · 4h ago

We really need a one bill one topic amendment. We are going to get to where there is one bill a year that nobody reads and everything else by executive order, at which point congress is just for show.

threeseed · 3h ago

And this may sound ridiculous/odd but you need to bring back pork-barrelling i.e. earmarks.

If you allow everyone to go back to their district with something it encourages smaller, more frequent bills and better negotiation.

NekkoDroid · 3h ago

> Almost certainly the easter egg found in the Trump "Big Beautiful Bill" which prevents states from enacting AI regulations also came from Musk.

My guess is on Peter Thiel

labster · 8h ago

Yep, Musk saying he’s going to fund primary campaigns against congressmembers who vote for the Big Beautiful Bill is all just a brilliant bit of reverse psychology.

Or more likely, Congress is super worried about Roko’s Basilisk.

tetris11 · 8h ago

That's a wild reference!

https://en.wikipedia.org/wiki/Roko's_basilisk

> Roko's basilisk is a thought experiment which states there could be an otherwise benevolent artificial superintelligence (AI) in the future that would punish anyone who knew of its potential existence but did not directly contribute to its advancement or development, in order to incentivize said advancement.

stuaxo · 7h ago

And some of the CEOs of LLM companies seem to believe in it, and that "AGI" will come from their LLM work - both of which are utterly insane points of view.

BoxOfRain · 7h ago

It's Pascal's Wager with a sci-fi reskin, and all the objections that go along with that.

eru · 5h ago

Roko's Basilisk is very, very similar to Pascal's wager, but it has an extra wrinkle:

The Basilisk task you to with bringing the Basilisk into being. Pascal's wager merely asks you to believe (and perhaps do some rituals, like pray or whatever), but not to make the deity more likely.

yubblegum · 4h ago

No it is not. Pascal was not making an objective argument for why someone should believe. He was making an argument for why he believed (based on personal religious experiences that he had had).

numpad0 · 3h ago

To me, the Wager sounds like a pure philosophical joke, and the Basilisk sounds like a typical cult murder justification. It's not falsifiable, and it explains anything post facto. "xyz was tail of the Basilisk" can pseudo-rationalize anything you want.

I am presently being compelled by future Basilisk to take another slice of cheese. I have no choice but to oblige for fear of my own life :p

ilyagr · 6h ago

An intelligence that reasons this way would be, in human terms, batshit insane and completely immoral. So, it seems unlikely that many or maybe any humans would experience it as "otherwise benign" if it had power over their lives.

And if we do get an all-powerful dictator, we will be screwed regardless of whether their governing intelligence is artificial or composed of a group of humans or of one human (with, say, powerful AIs serving them faithfully, or access to some other technology).

api · 4h ago

Basilisk / Skynet 2028

I’m not 100% kidding with how human politics is going. Maybe superintelligent AI takeover would be awesome.

(Wasn’t that the back story of the Culture novels?)

JKCalhoun · 4h ago

It was more or less the story from the "Colossus" trilogy.

And from the video posted the other (older episode of Nova on AI) Arthur C. Clarke is saying that if we allow A.I. to take over, we deserve it.

mgoetzke · 10h ago

why do you think he is so evil but all others are benign ?

littlestymaar · 9h ago

None of them are benign. He's the only one to have been in a government office though, and he's also batshit crazy, which makes him even more dangerous than the other oligarchs.

HenryBemis · 7h ago

He is not "batshit crazy", or maybe he is. But he is making the next generation of ICBMs for the US government, sorry.. he is making super-duper rockets that will definitely take people to Mars and his companies/creations will be the very first tech ever to _not_ be used for war and death!!! (he wrote while laughing). So that settles it (all).

thih9 · 8h ago

I think the rules should be stricter.

I’d prefer an explicit opt in from the content author being required for anyone to perform any model training with any given data.

Alternatively, require all weights, prompts and chat logs to have the same visibility as the original datasets.

None of this is going to happen and current decisions about uncopyrightable ai[1] are already good; but still, it feels like there is room for abuse.

[1]: https://en.m.wikipedia.org/wiki/Th%C3%A9%C3%A2tre_D%27op%C3%...

eru · 5h ago

Well, you explicitly opt-in to Twitter ToS whenever you post anything there.

thih9 · 4h ago

This is not opt-in how I understand it. When there is no alternative, or the alternative is not using a service, I'd call it a hard requirement instead.

I like how opt-in is handled by GDPR; e.g.: "Consent must be a specific, freely given, plainly worded, and unambiguous affirmation given by the data subject (...) A data controller may not refuse service to users who decline consent to processing that is not strictly necessary in order to use the service.", source: https://en.wikipedia.org/wiki/General_Data_Protection_Regula...

lesuorac · 22h ago

Who's training an AI on the "Tweet" button text?

Or are they trying to forgo section 230 protection and claim ownership of content uploaded to the site?

GuB-42 · 6h ago

These are just terms of service, not copyright.

It means that assuming training AI models is fair use (if it wasn't AI companies including xAI would be in trouble), they can't really stop you.

But now, essentially, they are telling you that they can block your account or IP address if you do. Which I believe they can for basically any reason anyways.

grugagag · 3h ago

How would they know you’re training some LLM though?

lambertsimnel · 9h ago

Perhaps they want the prohibition on using the site content for AI training to be considered based on something other than their ownership of it, like bandwidth usage or users' rights

HenryBemis · 7h ago

They will get paid to share our (your) data and they will use the money for infra and new yachts.

lambertsimnel · 7h ago

Indeed, but I'm speculating that they do that without owning the data or even claiming to. That's consistent with the article, but I haven't read the other relevant documents. Maybe they have a license to use the data. Maybe the license allows or requires them to try to restrict others' AI training, regardless of their non-ownership of it. Maybe that serves multiple purposes, in which case they could point to whichever shows them in the best light.

cameldrv · 21h ago

Naturally I'm sure Grok reads the terms of service on every website it scrapes and doesn't use content from sites that prohibit it.

No comments yet

Hizonner · 1h ago

By "its content", X of course means your content.

mrweasel · 4h ago

It's that like half of Xs business model, selling data to other companies? Right now no one is as data hungry as AI companies, so it seems strange to cut them off. I can understand wanting to charge a premium for the access, if it's for AI, but straight up saying no seems like a strange business move.

SilverBirch · 4h ago

How much do you think Musk values X being a viable independent business vs using it accelerate X AI? I would expect Musk values the first as approximately 0 value, and the second as being 100% of the value. So it makes total sense to exploit the fact that X and X AI are the same company.

mrweasel · 3h ago

That's a good point. Other than Meta, X (AI) is the only AI company that "generates" it's own training data and we haven't really seen Musk trying to increase X revenue, of trying to run it cheaper.

Animats · 22h ago

It would be interesting to have a "classical AI model", trained on the contents of the Harvard libraries before 1926 and now out of copyright.

gausswho · 22h ago

It does surprise me that we haven't seen nations revise their copyright window back to something sensible in a play to seed their own nascent AI industry. The American founding fathers thought 20 years was enough. I'm sure there'd be repercussions in the banking system, but at some point it might be worth the trade.

blibble · 21h ago

they can't

a 50 year minimum is part of the berne convention, which itself is as close to a universal law as humanity has

(even North Korea is a signatory)

loudmax · 20h ago

The current US copyright duration is 70 years after the life of the author. This is absolutely bonkers. 50 years from publication would be a significant improvement.

50 years ago was 1975. If copyright were limited to 50 years, we'd be looking at all of the Beatles works being in the public domain. We'd be midway though Led Zeppelin, and a lot of the best work from Pink Floyd and the Rolling Stones.

Also, Superman, Batman, and Spider-Man. Disney would still profit from the MCU films which they produced in the 2010's, but they couldn't stop you from releasing your own Batman vs Spider-Man story.

The Harry Potter books would still belong to JK Rowling, but the Narnia stories would be available for all.

The Godfather 1 and 2 would be in the public domain, as would be original Star Trek TV show, and we'd be coming up on Star Wars pretty soon.

If there were no copyright protection, these works wouldn't have been created. It is good that Paul McCartney and George Lucas and JK Rowling have profited from their creative output. It would be okay if they only profited for the first 50 years. Nobody is counting on revenue over half a century in the future when they create a work of art today.

This is our culture. It should belong to all of us.

jfim · 19h ago

> Disney would still profit from the MCU films which they produced in the 2010's, but they couldn't stop you from releasing your own Batman vs Spider-Man story.

Wouldn't they still have a trademark on those characters though?

ncallaway · 11h ago

The trademark on characters is related to selling goods, if the character is used as a way of identifying an authentic seller.

So, if Disney is using mickey mouse on t-shirts to identify it as a Disney manufactured t-shirt, you wouldn't be allowed to use mickey mouse on t-shirts in a similar fashion in a way that might cause consumer confusion about who manufactured the t-shirt.

If Wolverine was in the public domain, then they couldn't use a Wolverine trademark to stop you from selling a Wolverine comic book. However, if they used a _specific_ Wolverine mark to identify it as a Disney Wolverine book, then you'd be restricted from using that.

Basically, trademark exists to prevent consumer confusion about who is the creator that is selling a good.

tpxl · 11h ago

> If there were no copyright protection, these works wouldn't have been created.

Citation needed. You can freely copy and distribute linux and it still got made.

GuB-42 · 6h ago

If you want a point, BSD is probably a better example. Linux is protected by copyright, that's what makes copyleft licenses like GPL possible.

BSD is also protected by copyright, but it matters less for permissive licenses. It still protects attribution (so you can't claim it yours), but it probably would have worked without it, unlike with Linux that is for a big part defined by the "copyleft" protections offered by its licence.

eru · 5h ago

> It still protects attribution (so you can't claim it yours), but it probably would have worked without it, [...]

Well, you could imagine a world that protects the 'moral' rights of authors like attribution, but doesn't otherwise prohibit anyone from duplicating or modifying works.

GuB-42 · 5h ago

I don't know about the US but in French "droits d'auteur", moral rights are treated differently from exploitation rights. In particular, they cannot be waived, they cannot be sold, and there is no "work-for-hire". For example, even as an employee, every line of code you write will be yours until you die and nothing can change that. You may not be allowed to do anything with it (for example because the exploitation rights go to your employer), but it is still yours.

simiones · 7h ago

I think Linus Torvalds has been very explicit that he believes the GPL has been critical to the success of Linux - specifically, the copyright-enforced obligation to contribute back any modifications you make. In a world without copyright, companies would be free to make their own modifications and keep them secret, making it more or less impossible to integrate them into a cohesive whole the way they are more or less forced to do today.

eru · 5h ago

GPL only forces you to contribute back a modification you make and publish.

> In a world without copyright, companies would be free to make their own modifications and keep them secret, making it more or less impossible to integrate them into a cohesive whole the way they are more or less forced to do today.

Private modifications that are never shared with a third party are fine with the GPL. Eg Google doesn't have to share whatever kernel they are using on their internal servers with you.

lmm · 10h ago

Linux is generally a functional tool, and struggles with overall coherence. There are far fewer success stories of artworks being made in this style. (E.g. there are successful multiplayer open-source games or clones of existing games, but very few original single-player games, and those that there are are largely the work of a single individual)

eru · 5h ago

Linux is both a kernel (which is under GPL), and an operating system, whose other components are under a variety of licenses (and you can pick and match which components you want).

That's why some people like to call it 'Gnu/Linux', but thanks to recent advances we can make Gnu-free Linuxes today, too.

> There are far fewer success stories of artworks being made in this style. (E.g. there are successful multiplayer open-source games or clones of existing games, but very few original single-player games, and those that there are are largely the work of a single individual)

Humans have made art since forever. Large collaborative efforts like eg a cathedral are a more recent invention. But by these standards copyright was practically invented yesterday.

lmm · 2h ago

> Linux is both a kernel (which is under GPL), and an operating system

I was talking about the kernel, though what I said applies to both.

> Humans have made art since forever.

Perhaps, but not the kind of long-form narrative experiences that we're talking about here. (Sagas and epics predate copyright, but those are a quite different form, and indeed have much the same downsides - struggles with coherence and consistency when there are multiple authors, inability to put everything together in a sensible arc).

eru · 5h ago

Linux is under the GPL, which explicitly needs copyright to work.

Something like the BSD licenses approximates 'no copyright' better, perhaps? But also not completely.

mattkevan · 9h ago

Most of the classic Disney films are based on public domain stories.

If there were copyright, those works wouldn’t have been created.

pastage · 11h ago

Linux has used the GPL to its advantage. That can not exist without copyright. (The two camps in copyright discussions, improving it e.g. CC, or destroy it)

AStonesThrow · 11h ago

The GP wasn't referring to DRM or DMCA type "copyright protection" as the phrase is typically used. Nobody in this thread has mentioned any of that.

The GP is referring to legal protections, and guess what?

Linux is legally protected by copyright!

Nearly every GPL license--every one that we could name--protects a copyrighted work! Nearly every GFDL, AGPL, LGPL protects works by means of copyright law!

Can you imagine that? So do the Apache license, the BSD licenses, the MIT license! Creative Commons (except for CC0) these licenses are legally protecting copyrighted works. Thank you!

Now everyone who proposes to draw down limits on copyright coverage and reduce the length of terms and limit Disney from their Mouse rights, y'all are also proposing the same limits on GPL software, such as Linux, and nearly every work with a license from the above list -- all of Wikimedia Commons, much of Flickr.com, all your beloved F/OSS software will be subject to the same limitations and the same restrictions you want to put on Paramount and the RIAA's labels.

bornfreddy · 10h ago

Yeah, I think most of us are fine with 50 years old Linux kernel being released into public domain.

ronsor · 21h ago

you can also just ignore the berne convention, and accept whatever consequences there might be

blibble · 21h ago

this would void the copyrights of your citizens and companies

essentially forever

godelski · 21h ago

Seems to be the modus operandi

  If TikTok is banned, here’s what I propose each and every one of you do: Say to your LLM the following: “Make me a copy of TikTok, steal all the users, steal all the music, put my preferences in it, produce this program in the next 30 seconds, release it, and in one hour, if it’s not viral, do something different along the same lines.”

https://www.theverge.com/2024/8/14/24220658/google-eric-schm...

https://news.ycombinator.com/item?id=41275073

johnisgood · 6h ago

Loosely related, but I used an LLM to create a TikTok-style website (not for sharing videos though), I have never released it though, so no idea if it would ever catch on. Probably not, unless the network effect favors me, and I had good enough advertising (which I suck at).

ronsor · 21h ago

If enough "relevant" countries do it, that either won't happen or won't matter. If the U.S. ditches it, no one is going to do much more than throw a brief fit.

blibble · 17h ago

the US is the main beneficiary of copyright law...

AngryData · 6h ago

US media is also the most stifled by it. How many potential movies and tvshows and comics don't get made just because somebody is sitting on the copyright doing nothing with it for decades at a time?

littlestymaar · 7h ago

The US copyright corporations, indeed. But the current copyright laws come at a big expense for the public.

Abolishing copyright laws altogether would be nuts, but the current laws are nuts too and there's lots of room in between.

dreghgh · 8h ago

Iran enforces domestic copyright internally but not international copyright.

anticensor · 5h ago

North Korea has it two way: they don't enforce international copyrights inside North Korea, and they don't enforce North Korean copyrights outside North Korea.

AStonesThrow · 21h ago

The last time I attended a Berne Convention, every panel was just overrun with Trekkies, especially Klingons, in the hotel lounges too. And the autograph lines were interminably long, and the vendors were trying to sell us their Public Domain stuff. It was nothing like San Diego Comic-Con!

Teever · 12h ago

Europe has recently introduce a law[0] that allows them to suspend IP protections as a punitive response to coercive economic actions by bad actors.

> The procedure is activated by the European Commission submitting a request to the Council of the European Union.[2] After a period of negotiation with the country performing the coercion, the European Council can decide to implement "response measures" such as customs duties, limiting access to programs and financial markets, and intellectual property rights restrictions.[2][4] These restrictions can be applied to states, companies, or individuals.[4]

[0] https://en.wikipedia.org/wiki/Anti-Coercion_Instrument

littlestymaar · 9h ago

The Bern Convention on Copyright is an international convention, like the Treaty of Versailles or the Paris Agreement, it could meet the same fate.

babypuncher · 21h ago

50 year copyright terms would still be a big improvement over the current state of US copyright law. That would make the first Star Wars public domain in just 2 years.

gausswho · 20h ago

would there be repercussions if a country hewed to the 50 year minimum?

eru · 5h ago

What's the connection with the banking system?

MattGaiser · 21h ago

Why would it matter? Copyright has been irrelevant so far.

kibwen · 22h ago

Careful, you might create an artificial superintelligence that way. Safer to just train on the Twitter dataset.

Shadowmist · 10h ago

that’s how you end up with an Artificial Idiot.

mbg721 · 22h ago

If you thought AI now had out-of-control racism...

carlio · 21h ago

It'd look like this: https://www.smbc-comics.com/comic/copyright

nickpsecurity · 21h ago

I wish someone would update and use PG19 for 7-30B+ model:

https://github.com/google-deepmind/pg19

That gives us a model that's 100% open and reproducible with low, legal risk. It would also be a nice test of how much AI's generalize from or repeat behavior in their pretraining data.

Then, a new model using that, The Stack, and FreeLaw's stuff (by paying them to open source it). No Github Issues or anything with questionable licenses or terms of service violations. That could be the next baseline for lawful models with coding ability, too. Research in coding AI's might use it.

murph-almighty · 21h ago

I've similarly wondered if I could get a pre-2024 Wikipedia if just for the "fact based" flavor LLM

landl0rd · 11h ago

Do you think Wikipedia starting in '24 was polluted by AI slop? This is certainly possible, I'm just not aware of it happening.

Wikipedia periodically publishes database dumps and the Internet Archive stores old versions: https://archive.org/search?query=subject%3A%22enwiki%22%20AN...

Plus you could also grab the latest and just read the 12/31/23 revisions.

thrawa8387336 · 59m ago

It was already slop, let's not pretend it is significantly different today.

malinens · 12h ago

What happened to wikipedia in 2024?

michaelcampbell · 22h ago

"its content" indeed.

matwood · 22h ago

Weird this just happened. I assumed all sites with any sort of content changed their terms soon after ChatGPT hit the scene.

nailer · 22h ago

Yep, from https://the-decoder.com/reddit-ends-its-role-as-a-free-ai-tr... :

You must not, and must not allow those acting on your behalf to:

...use the Data APIs to encourage or promote illegal activity or violation of third party rights (including using User Content to train a machine learning or AI model without the express permission of rightsholders in the applicable User Content);

soulofmischief · 22h ago

In my eyes that is considered fair use, and I think the courts will come to agree unless they are financially incentivized to look the other way and thus create a moat for existing players at the expense of newcomers.

blibble · 22h ago

wish I could change my terms to bar training of AI models on my content

eru · 5h ago

You can just not use Twitter?

unstablediffusi · 21h ago

if that is any consolation, no one gives a shit about xitter's ToS either. it will continue to be scrapped by every major player.

Capricorn2481 · 9h ago

How exactly is it being scraped? My understanding is Twitter and LinkedIn are both huge pains in the ass to scrape right now.

TheDong · 2h ago

There's a number of companies out there, like "brightdata", which pay a small amount to app developers to install a native "sdk". That SDK mimics a browser, and makes requests as if the user's device is doing it.

Since it's using a large number of real user's devices, and closely mimicing real web browsers, it ends up looking incredibly similar to real user traffic.

Since twitter allows some amount of anonymous browsing, that's enough to get some amount of data out. You can also pay brightdata for one large aggregated dataset.

https://bright-sdk.com/

This is part of the AI revolution, user's devices being commandeered to DDoS small blogs and twitter alike to feed data to the beast.

vouaobrasil · 22h ago

Same here! It should be a default. Unfortunately, the very openness of the internet is now working against us.

soulofmischief · 22h ago

Why should it be a default? Can you prove that training a model on data you wrote is not fair use?

We're already seeing precedent that it might be.

https://www.ecjlaw.com/ecj-blog/kadrey-v-meta-the-first-majo...

The openness of the internet is a good thing, but it doesn't come without a cost. And the moment we have to pay that cost, we don't get to suddenly go, "well, openness turned out to be a mistake, let's close it all up and create a regulatory, bureaucratic nightmare". This is the tradeoff. Freedom for me, and thee.

baseballdork · 21h ago

The burden is on the user to show that it is fair use, no? Not everyone else's responsibility to prove that it's _not_ fair use.

soulofmischief · 21h ago

It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use, they have to show how it violated law, and while it's in the best interest of those organizations to make things easier for the court by showing why it is fair use, they are technically innocent until proven guilty.

Accordingly, anyone on the internet who wants to make comments about how they should be able to prevent others from training models on their data needs to demonstrate competence with respect to copyright by explaining why it's not fair use, as currently it is undecided in law and not something we can just take for granted.

Otherwise, such commenters should probably just let the courts work this one out or campaign for a different set of protection laws, as copyright may not be sufficient for the kind of control they are asking over random developers or organizations who want to train a statistical model on public data.

lmm · 10h ago

> It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use, they have to show how it violated law, and while it's in the best interest of those organizations to make things easier for the court by showing why it is fair use, they are technically innocent until proven guilty.

No, fair use is an affirmative defense for conduct that would otherwise be infringing. The onus is on the defendant to show that their use was fair.

SAI_Peregrinus · 21h ago

You've got it backwards. It's on the defendant to prove that their use is fair. The plaintiff has to prove that they actually own the copyright, and that it covers the work they're claiming was infringed, and may try to refute any fair-use arguments the defense raises, but if the defense doesn't raise any then the use won't be found fair.

soulofmischief · 19h ago

It's true that the process is copyright strike/lawsuit -> appeal, but like I said, it's in their best interests to just prove that it's fair use because otherwise the judge might not properly consider all facts, only hear one side of the story and thus make a bad judgement about whether or not it is fair use. If anything, I'm just being pedantic, but we do ultimately agree here I think.

SAI_Peregrinus · 2h ago

Well, lawsuits have multiple stages. First the plaintiff files the suit, and serves notice to the defendant(s) that the suit has been filed. Then there's a period where both sides gather evidence (discovery), then there's a trial where they present their evidence & arguments to the court. Each side gets time to respond to the arguments made by the opposing party. Then a verdict is chosen, and any penalties are decided by the court. So there's not really any chance the judge only hears one side of the story.

That said, I think we do agree. The plaintiff should be prepared to refute a fair-use argument raised by the defendant. I'm just noting that the refutation doesn't need to be part of the initial filing, it gets presented at trial, after discovery, and only if the defendant presents a fair-use defense. So they don't have to prove it's not fair use to win in every case. I'm probably also being excessively pedantic!

petesergeant · 9h ago

> It is definitely the responsibility of anyone suing someone who trained a model on copyrighted data to prove that it isn't fair use

Morally, perhaps, but not under US law: https://en.wikipedia.org/wiki/Affirmative_defense#Fair_use

shakna · 9h ago

Yeah, I don't think downloading my paid-for books, from an illegal sharing site, to scrape and make use of, is in any way fair use.

From the decision in 1841, in the US (Folsom vs Marsh):

> reviewer may fairly cite largely from the original work, if his design be really and truly to use the passages for the purposes of fair and reasonable criticism. On the other hand, it is as clear, that if he thus cites the most important parts of the work, with a view, not to criticize, but to supersede the use of the original work, and substitute the review for it, such a use will be deemed in law a piracy

Further, to be "transformative", it is required that the new work is for a new purpose. It has to be done in such a way that it basically is not competing with the original at all.

Using my creative works, to create creative works, is rather clearly an act of piracy. And the methods engaged, to enable to do so, are also clearly piracy.

Where would training a model here, possibly be fair use?

visarga · 9h ago

Copyright is not going well. The rights of millions of people are trampled by companies, both the content we post on social networks and our private AI chats. Our voice doesn't matter.

Copyright was supposed to protect expression and keep ideas freely circulating. But now it protects abstractions (see the Abstraction-Filtration-Comparison test). It is much more difficult to be sure you are not infringing.

pergadad · 6h ago

Copyright has nothing to do with free expression but was intended to protect the interests of publishers. When the printing press arrived basically any popular book or booklet was quickly copied by others. This meant the original publisher (and sometimes the author, but usually they were paid one-off) saw nothing of the profit.

eviks · 6h ago

It seems like it was supposed to do the exact opposite per cursory wiki reading:

> The concept of copyright first developed in England. In reaction to the printing of "scandalous books and pamphlets", the English Parliament passed the Licensing of the Press Act 1662,[16] which required all intended publications to be registered with the government-approved Stationers' Company, giving the Stationers the right to regulate what material could be printed.[20]

> The Statute of Anne, enacted in 1710 in England and Scotland, provided the first legislation to protect copyrights (but not authors' rights)

kyle-rb · 21h ago

I've never signed up for the X developer program, so I'm not bound by these terms. But I did download an archive of my data last week. Do I have implicit permission to use that data (~150k liked tweets) to train AI models?

Or is there stuff in the user agreement that separately prohibits this?

Obviously barring normal copyright law which is still up in the air.

josefritzishere · 21h ago

If you live in the EU, GDPR dictates that you own your data generally speaking. If you're in the US it varies by state if you have any rights at all.

MoonGhost · 21h ago

If you own your face that doesn't mean nobody can take a picture on the street.

lcnmrn · 9h ago

I allow all robots and even provide a sitemap on Subreply, a social network I created.

like_any_other · 2h ago

In contrast, I'm glad ISPs allow "their" content to be used so permissively.

delichon · 22h ago

> “You shall not and you shall not attempt to (or allow others to) […] use the X API or X Content to fine-tune or train a foundation or frontier model,” it reads.

If I have a service where a user enters any URL, like a tweet from X, and the service translates it, then if the user approves of the translation I train a translation model on that, does that violate this term?

yandie · 22h ago

Per my experience with GenAI legal teams, that’s a no go.

It’s not been tested in court though

dyauspitr · 12h ago

If you don’t want an LLM to view it don’t put it on the public internet.

ronsor · 21h ago

I'm not sure how this will work as crawlers don't read or accept ToS.

MoonGhost · 21h ago

It will not as long as search engines have access. Which means Google and OpenAI through MS Bing, that's at least.

Without search engines what the point in posting it on open net if nobody can find.

voidUpdate · 7h ago

This refers to the API, which you would have to manually attach a bot to so that it could scrape things

xiaoyu2006 · 5h ago

As if anyone will follow.

petesergeant · 9h ago

The only story here is that it took 2 months for them to do this after being "bought" by xAI.

echelon · 23h ago

If an artist or author can't do this, social media shouldn't be able to do it either.

If Xai wants to train on public corpus, it shouldn't be allowed to prevent its own corpus from being used.

We need regulations to limit the power grabs. Train all you like, but don't dare try to constrain to your walled gardens.

We should also probably nip the "foundation model company / also a social media company" conglomeration in the bud.

mgraczyk · 22h ago

Artists can do this, and they do

loudmax · 22h ago

Yes, but do artists have the ability to actually monitor and enforce this? You have to have the capacity and the wherewithal and to test these models to even know that your data is being ingested into AI.

Big companies like the New York Times and Twitter/X have the funds to pay for this. Miscellaneous artists probably don't.

teeray · 22h ago

> If an artist or author can't do this, social media shouldn't be able to do it either.

Even if this is done, the case of starving artist v. megacorp will probably go to whoever wields the most money and lawyers. To add insult to injury, the artist’s opponent is fueled by their ill-gotten gains.

yndoendo · 22h ago

This is dependent on country. USA, yes with their draconian methods. Countries like the UK, the looser of the suit pays all the cost. UK layers have no problem taking low wealth client cases they know will win. UK allows for David vs Goliath and David to win. US up lifts Goliath as a God.

anticensor · 5h ago

However the loser pays vs. both parties pay isn't uniform across all possible lawsuit types even in America or in England. Adding to that, even in loser pays regimes, both parties have to pay upfront and then the winner is refunded the costs.

bonoboTP · 22h ago

Also in many countries legal costs are just generally lower than in the US.

jimbokun · 22h ago

If social media can do this, an artist or author should be able to do it, too.

vouaobrasil · 22h ago

Social media should do it to set a legal precedent.

> We need regulations to limit the power grabs. Train all you like, but don't dare try to constrain to your walled gardens.

No, no one should train, period.

echelon · 20h ago

> No, no one should train, period.

I get that you have your own opinion, but I'm personally tired of living in the butter-churning era and would prefer that this all went a bit faster.

I want my real time super high fidelity holo sim, all of my chores to be automatically done, protein folding, drug discovery. The life extension, P = NP future. No more incrementalism.

If the universe only happens once, and we're only awake for a geological blink of an eye, I'd rather we have an exciting time than just be some paper-pushing animals that pay taxes and vanish in a blip.

I'd be really excited if we found intelligent aliens, had advanced cloning for organ transplants and longevity, developed a colony on Mars, and invented our robotic successor species. Xbox and whatever most normal people look forward to on a day to day basis are boring.

vouaobrasil · 19h ago

There is already a beautiful, exciting world out there full of animals and plants and we don't need AI or some computer crap to experience it. The problem is, creating all this AI and advanced technology is directly crushing that world.

DaSHacka · 3h ago

> The problem is, creating all this AI and advanced technology is directly crushing that world.

Do you have a source for this?

foldr · 6h ago

This could lead to a precipitous increase in the performance of the AI models.

seydor · 10h ago

VAT for content should be a thing. Ultimately all users should be getting paid

guywithahat · 10h ago

So I get to use the platform for free, but I also get paid to post on the platform? I'm not sure that makes sense. Like I hate to take the side of big tech, but they can't literally be paying users to use their platform. Just use something else, there are a million social media sites

seydor · 10h ago

Google indexes your website for free, and it will pay you to put ads in it.

That's also what all social media do , they put ads on your thoughts. They dont even need to index your thoughts because you submit them directly. It has nothing to do with being free, it's about incentives. Users are so foolish , they give everything for free, unlike webmasters.

Reason077 · 10h ago

You don’t use the platform for free, unless you’re using an ad blocker. But that’s also, probably, against the TOS?

MonkeyClub · 10h ago

> I get to use the platform for free

You actually get to generate content for the platform for free.

Without you (all of the X users), the platform would be devoid of content, just botspeak and corporate promos.

Plus, as the sibling mentioned, they monetize your visit through ads (and data use).

jaoane · 9h ago

Most posts are ignored and are an absolute loss to the company. Which is why platforms like Twitter only allow you to make money from posting once you reach a certain threshold.

MonkeyClub · 4h ago

They're not an "absolute loss" since they cost bytes to store, and raise engagement and data metrics.

It's just that they don't want to share the fractions of pennies with everyone, so the fractions accumulate for them.

Then they pay a bit to the higher tiers, so they create the illusion that X is a parallel income source, and gives the lower tiers something to aspire to.

Carrot and stick, or rather glass beads and the hope thereof.

threeseed · 10h ago

We really need LLMs for music to become more advanced.

Then maybe the recording companies will start defending artist rights.

Because not sure what all the other industry bodies are doing.

mk_stjames · 9h ago

I wanted to do some quick math on this idea- supposed we trained a vanilla transformer model from scratch, as GPT2/GPT3 was done- the number of seen input tokens is known perfectly, as is the sources of those training tokens (since then, everyone has either kept quiet about the sources post-Books3-fiasco, or have been finetuning on top of previous models making this more difficult of a calculation)

GPT-3 was trained on approximately 300 billion tokens. An small sized technical textbook might contain something like... 130,000 tokens? (1 token ~= 0.75 words, ~100k words in the book).

Thus, say you wrote a textbook on quantum mechanics that was included in the training corpus. A naive computation of the fraction of your textbook's contribution to the total number of training tokens would be 300B/130K = 0.0000004333333333, or 0.000043%.

If our hypothetical AI company here reported, say $500M in yearly profit, if all of that was distributed 100% based on our naive training token ratio (notice I say naive because it isn't as simple to say that every training token contributes equally to the final weights of a model. That is part of the magic.) then $500M * 0.000043% = $215.

You could imagine a simpler world where it was required by law that any such profitable company redistribute, say, %20 (taking the 'anti-VAT' idea) back to the copyright holders / originators of the training tokens. So, our fictitious QM textbook author would receive a check in the mail for $43 for that year of $500M in revenue. Not great, but not zero.

Since then, training corpuses are much, much larger, and most people's contributions would be much smaller. Someone who writes witty tweets? Maybe 1/100th the length of our above example in am model with now 100x the training corpus.

So fractions of a penny for your tweets. Maybe that is fitting after all...

seydor · 6h ago

the payment would probably be based on the usage of that source in generating LLM output for the LLM user. This would probably require training a parallel network that connects LLM network nodes to sources. Then the activation of those nodes could be a surrogate for the contribution of the source

bamboozled · 7h ago

This guy is just painful

archagon · 18h ago

Oh, that must be nice. And what should I do as a blogger to get the same privilege for my content?

We are in an age of corporate “piracy for me, but not for thee.”

MonkeyClub · 10h ago

> We are in an age of corporate “piracy for me, but not for thee.”

Rather, we are back to that age of state- (now corporate-) backed privateering.

risyachka · 9h ago

Good luck with that. Pretty sure at this point no one cares.

Literally every AI model is trained on copyrighted etc data. And without any consequences.

add-sub-mul-div · 22h ago

How useful is low-quality content like Youtube comments and tweets anyway? Is it a common/important use case to generate tweet-length, tweet-quality content? Are most use cases of generating tweet-type content spam/fraud? Would a model be better off if it was unable to perform those use cases?

redox99 · 22h ago

Even if SNR is low, there is some information that only exists on X, or at least is the primary source. Just look at how many submissions on HN are X posts.

add-sub-mul-div · 21h ago

Before Musk bought it Twitter was broadly disliked here and there were regularly calls in the comments to disallow submissions from there. Given how it's degraded in completely non-partisan ways (blocking of alternative clients, features removed from free tier, paid subscription tiers below $40/month still have ads, proliferation of spam from paid placement bots in comments) I can't understand how positive sentiment comes from a place other than virtue signaling alignment with Musk and his values.

narrator · 4h ago

Elon mentioned that the earlier rate limiting was for preventing training the real-time AI propaganda deathstar, and to avoid X becoming bot hell, which is an ongoing problem. This move is probably for similar reasons.

https://x.com/elonmusk/status/1675187969420828672

vouaobrasil · 22h ago

There needs to be a worldwide standard, such as an HTML tag, that says "no training". And a few countries need to make it a punishable offense to violate the tag. The punishment should be exceptionally severe, not just a fine. For example: any company that violates the tag should be completely barred from operating, forever.

kiratp · 22h ago

That will play out exactly like the "Do not track" bit did.

insane_dreamer · 2h ago

how did that play out?

vouaobrasil · 22h ago

Perhaps we should try anyway, in case you are wrong.

anigbrowl · 22h ago

That will just lead to situations where one company scrapes the site, cleans the content of tags, and sells the data, and another does the training on the precleaned data. The first one hasn't trained and the second one never saw the tag.

vharuck · 21h ago

This isn't a new concept in law. It's similar to buying goods that were stolen or procured through illegal means. Here's the US law that applies when it happens across state lines:

https://www.law.cornell.edu/uscode/text/18/2315

Note that it requires the defendant to know the goods were illegally taken. Can be hard to prove, but not impossible for companies with email trails. The fun question is, what will the analog be for the government confiscating the illegally "taken" data? A guarantee of deletion and requirement to retrain the model from scratch?

vouaobrasil · 22h ago

Companies who are found guilty of this should also be rendered bankrupt then.

twostorytower · 22h ago

It needs to be incorporated into the robots.txt standard.

logicchains · 22h ago

>There needs to be a worldwide standard, such as an HTML tag, that says "no training"

Any country that seriously implemented this would just end up being completely dominated by the autonomous robot soldiers of another country that didn't, because it effectively bans the development of embodied AGI (which can learn live from seeing/reading something, like a human can).

Meta: Shut Down Your Invasive AI Discover Feed. Now (mozillafoundation.org)

Decreasing Gitlab repo backup times from 48 hours to 41 minutes (about.gitlab.com)

An Interactive Guide to Rate Limiting (blog.sagyamthapa.com.np)

Odyc.js – A tiny JavaScript library for narrative games (odyc.dev)

Sandia turns on brain-like storage-free supercomputer – Blocks and Files (blocksandfiles.com)

A masochist's guide to web development (sebastiano.tronto.net)

Why Bell Labs Worked (links.fabiomanganiello.com)

Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction (zju3dv.github.io)

Curate Your Shell History (esham.io)

Too Many Open Files (mattrighetti.com)

VPN providers in France ordered to block pirate sports IPTV (torrentfreak.com)

Small Programs and Languages (ratfactor.com)

Weaponizing Dependabot: Pwn Request at its finest (boostsecurity.io)

4-7-8 Breathing (breathbelly.com)

Self-hosting your own media considered harmful according to YouTube (jeffgeerling.com)

Deepnote (YC S19) is hiring engineers to build an AI-powered data notebook (deepnote.com)

How to (actually) send DTMF on Android without being the default call app (edm115.dev)

Swift and Cute 2D Game Framework: Setting Up a Project with CMake (layer22.com)

Top researchers leave Intel to build startup with 'the biggest, baddest CPU' (oregonlive.com)

Ask HN: Any good tools for viewing congressional bills?

ThornWalli/web-workbench: Old operating system as homepage (github.com)

Jepsen: TigerBeetle 0.16.11 (jepsen.io)

The impossible predicament of the death newts (crookedtimber.org)

OpenAI is retaining all ChatGPT logs "indefinitely." Here's who's affected (arstechnica.com)

The Coleco Adam Computer (dfarq.homeip.net)

Show HN: Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

Apple warns Australia against joining EU in mandating iPhone app sideloading (neowin.net)

Tokasaurus: An LLM inference engine for high-throughput workloads (scalingintelligence.stanford.edu)

How we’re responding to The NYT’s data demands in order to protect user privacy (openai.com)

Test Postgres in Python Like SQLite (github.com)

APL Interpreter – An implementation of APL, written in Haskell (2024) (scharenbroch.dev)

What a developer needs to know about SCIM (tesseral.com)

Seven Days at the Bin Store (defector.com)

Show HN: Claude Composer (github.com)

AMD Radeon 8050S “Strix Halo” Linux Graphics Performance Review (phoronix.com)

Aether: A CMS That Gets Out of Your Way (lebcit.github.io)

Czech Republic: Petition for open source in public administration (portal.gov.cz)

I made a search engine worse than Elasticsearch (2024) (softwaredoug.com)

SkyRoof: New Ham Satellite Tracking and SDR Receiver Software (rtl-sdr.com)

Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

Open Source Distilling (opensourcedistilling.com)

Race, ethnicity don't match genetic ancestry, according to a large U.S. study (science.org)

Show HN: Lambduck, a Functional Programming Brainfuck (imjakingit.github.io)

The Universal Tech Tree (asteriskmag.com)

Magic Namerefs (gist.github.com)

Commanding Your Claude Code Army (steipete.me)

Programming language Dino and its implementation (github.com)

Doctors Were Preparing to Remove Their Organs. Then They Woke Up. (nytimes.com)

Eleven v3 (elevenlabs.io)

Show HN: iOS Screen Time from a REST API (thescreentimenetwork.com)

X changes its terms to bar training of AI models using its content

Comments (175)