Planetary Anomaly Spreading, Traced to Unknown Forces Beneath Earth's Crust (sustainability-times.com)

It shows that models are limited in how much they can memorise (~3.6 bits per parameter), and once that threshold is reached, the model starts to generalise instead of memorise.

pixl97 · 14h ago

> However, they tend to memorize unwanted information, such as private or copyrighted content,

I mean humans don't forget copyrighted information. We just typically adjust it enough (some of the time) to avoid getting a copyright strike while modifying it in some way useful.

We don't forget 'private' information either. We might not tell other people that information, but it still influences our thoughts.

The idea of a world where we have AI minds forget vast amounts of information that humans have to deal with every day is concerning and dystopian to me.

squidbeak · 6m ago

I agree. As far as copyrighted and artistic works go, I've never fully understood what the objection is. If the work is being remixed not copied then it surely falls under fair use? Meanwhile, if it creates something new in an artist's style, it's only doing what talented imitators routinely do. There's the economic argument. But if that's accepted, then for fairness it would have to be extended to every other profession which stands to be wiped out by AI, which would be daft.

New works in familiar styles are something I can't wait for. The idea that the best Beethoven symphony hasn't been composed yet, or that the best Basquiat hasn't been painted yet, or that if the tech ever gets far enough, Game of Thrones might actually be done properly with the same actors, is a pretty mouthwatering prospect. Also styles we haven't discovered, that AI can anticipate. How's it to do that without a full understanding of culture? Hobbling the delight it could bring generally for the sake of protected classes will just make the tech less human and a lot less exciting.

johnjreiser · 13h ago

I'd counter with an anecdote; I had a colleague that boasted how he memorized a classmate's SSN in college and would greet him by SSN when seeing him years later. Is the goal of AI to replicate the entirety of the human experience (including social pressures, norms, and shame) or a tool to complement human decision making?

While, yes, you can argue the slippery slope, it may be advantageous to flag certain training material as exempt. We as humans often make decisions without perfect knowledge, and "knowing more" isn't a guarantee that it produces better outcomes, given the types of information consumed.

lmm · 10h ago

Knowing more might not improve your accuracy but it's not going to harm it. Forcibly forgetting true parts of your knowledge seems far more likely to have unintended consequences.

Dylan16807 · 8h ago

I disagree. Actively fighting against your memory will slow you down in any context where some memorized idea is similar to what you're doing but you shouldn't be using the memorized idea.

conception · 9h ago

Counterpoint: There are plenty examples of breakthroughs from folks who are ignorant of the “right” way to go about it. A fresh take isn’t always bad.

lou1306 · 5h ago

One obvious consequence: the model might still produce copyright infringement because it thinks its creative ideas are novel.

genewitch · 5h ago

If the copyrighted content is not in the training data, and I mean explicitly, and the AI produces a copyrighted output, I'd argue it's a clean room re-implementation, and also it ought devalue the original work, moreso if the work is more recent. Maybe.

I get that "first to publish" matters to a lot of people, but, say 5 unrelated people are writing unique screenplays about a series of events that seems important to them or culture or whatever; if they all come up with very similar plots and locations and scenes, it just means that the idea is more obvious than non-obvious.

Please, argue. I haven't fully reconciled a lot of this to myself, but off the cuff this'll do.

The logic being - if an AI without taint produces some other work, that work drew on the same information the model did, and came to the same "conclusion" - which means with a time machine, you could wipe the LLM, go back to the period of the original work, train the LLM, and produce the work contemporaneous to the original. Hope that made sense.

lmm · 4h ago

> If the copyrighted content is not in the training data, and I mean explicitly, and the AI produces a copyrighted output, I'd argue it's a clean room re-implementation

You can't claim it's a clean room without actually doing the legwork of making a clean room. Not including the copyrighted work verbatim isn't enough, you would need to show that the AI hadn't seen anything derived from that copyrighted work, or that it had seen only non-copyrightable pieces.

lou1306 · 4h ago

> The logic being - if an AI without taint produces some other work, that work drew on the same information the model did, and came to the same "conclusion" - which means with a time machine, you could wipe the LLM, go back to the period of the original work, train the LLM, and produce the work contemporaneous to the original. Hope that made sense.

This logic would immediately get shot down by an "Objection, speculation" in an actual litigation. Besides, the technicalities of how the work was produced don't really play a role in assessing infringement. PK Dick wrote "The man in the high castle" by extensively using the I Ching, but if I use it and recreate the novel by complete accident I would still be infringing.

By the way, I highly suggest Borges's "Pierre Menard, Author of the Quixote" as a great story on the topic of authorship :)

lynx97 · 4h ago

The goal of AI is to make money. All the moralisation is very human, but also extremely naive.

BTW, I don't really understand what "social pressure" and "shame" has to do with your story? In my book, the person with a good memory isn't to blame. They're just demonstrating a security issue, which is a good thing.

falcor84 · 4h ago

In that example, the mnemonist should be demonstrating the security issue to the government, and not to their friend. We have social taboos for this reason. As an extreme example, I wouldn't greet a person by their penis size after noticing it in the locker room - some information should still be considered private, regardless of how we came to obtain it.

Same with an LLM, when it got sensitive information in its weights, regardless of how it obtained it, I think we should apply pressure/shame/deletion/censorship (whatever you call it) to stop it from using that information in any future interactions.

lynx97 · 3h ago

I am probably too autistic to recognize remembering a personal datum as a taboo.

However, I am totally on your side regarding LLMs learning data they shouldn't have seen in the first place. IMO, we as a society are too much chicken to act on the current situation. Its plain insane that everyone and their dog knows that libgen has been used to train models, and the companies who did this experiencing NO consequences at all. After that, we shouldn't be surpised if things go downhill from here on.

Red Flags to Watch for When Evaluating Developer Productivity Tools (jellyfish.co)

Dissecting "Why Did the Chicken Cross the Road?" (taylor.town)

23andMe seeks new bids after $305M offer from its co-founder (finance.yahoo.com)

PSG is the proof Founders can win without Top Talent [video] (youtube.com)

Ken Jennings: Trivia and 'Jeopardy ' Could Save Our Republic (nytimes.com)

Fruit flies are rewriting the story of cocaine research (latinamericanpost.com)

Want to Get Stronger and Avoid Injury? Try Eccentric Exercises (nytimes.com)

Energy: How to Build Compute in America (chinatalk.media)

Surion GmbH: New AI Consulting and Academy Focused on Real-World Application (surion-group.com)

Claude Gov Models for U.S. National Security Customers (anthropic.com)

Show HN: PromptLab: Run LLM prompts directly inside Google Sheets (promptlabco.com)

Practical AI Is Boring. I Think That's the Point (danhannigan.me)

The impossible predicament of the death newts (crookedtimber.org)

Disrupting malicious uses of AI (openai.com)

Twitter's new encrypted DMs aren't better than the old ones (mjg59.dreamwidth.org)

How Nintendo dodged Trump’s tariffs and saved the Switch 2 release (theguardian.com)

Adolescent capuchins kidnap the offspring of howler monkeys out of boredom (english.elpais.com)

Vibe Meter: Monitor Your AI Costs (steipete.me)

Show HN: I built Claude code but for image generation (agent.trybezel.com)

Front Brake Lights Could Drastically Diminish Road Accident Rates (bioengineer.org)

Apple Notes Will Gain Markdown Export at WWDC, and, I Have Thoughts (daringfireball.net)

Planetary Anomaly Spreading, Traced to Unknown Forces Beneath Earth's Crust (sustainability-times.com)

Show HN: DashGPT, an AI spreadsheet to dashboard tool (dashgpt.ai)

Show HN: LaminarFlow – Launched v0.1 – open-source finance platform for startups (lamflo.xyz)

A non-trivial PR (+1641/-1125) written ~80% with AI agents (twitter.com)

Live iSpace Hakuto-R Lunar Landing Countdown [video] (youtube.com)

Universal Disk Format is a "dumpster fire" on all the main operating systems (old.reddit.com)

Marking 21 Years of Covering Linux Hardware (phoronix.com)

I built an open-source tool that adds RAG context to JetBrains AI Assistant (github.com)

We're building your personal AI for internal knowledge. Help shape it (lp.igpt.ai)

Puck 0.19: Slots API & performance gains (puckeditor.com)

Cancer more deadly when tumours lack Y chromosome – the loss could be contagious (nature.com)

APLearn: Machine Learning Library (github.com)

How Do We Define a Currency? [video] (youtube.com)

Recognizing Where Accessibility Is Being Implemented, Featuring TabBuddies (whatever.scalzi.com)

Show HN: MCP Server that simulates smart home and lifestyle devices (github.com)

Biotech Life in the Software City (alexkesin.com)

OpenAI takes down covert operations tied to China (npr.org)

Taming Wild CSVs: Advanced DuckDB Techniques for Data Engineers (motherduck.com)

'How to Be Well' Review: In Search of the Glow (wsj.com)

SEO MCP Server Release (seoreviewtools.com)

BGP Attacks Can Unmask Users [video] (youtube.com)

Debug Crashes in iOS Using MetricKit (ohmyswift.com)

Restricting the Entry of Foreign Nationals to Protect the United States (whitehouse.gov)

Reaction v2.0 released, rewritten in Rust (blog.ppom.me)

How I Use LLMs to Write (fullydoxxed.com)

Picohttpparser: Fast HTTP request/response parser in C (github.com)

Ballistic Testing of an Aerogel/Starch Composite for Protective Equipment (mdpi.com)

When Compilers Were the 'AI' That Scared Programmers (vivekhaldar.com)

From good to great: AI-powered Aiven for PostgreSQL server tuning (dbtune.com)

Not all tokens are meant to be forgotten

Comments (14)