"New Coke" is one of the most notable failed product launches in the American food and beverage industry, and I feel like some of its core lessons are becoming increasingly relevant to modern AI developers.
For those <40: In the 1980s senior executives at Coca Cola had a problem: Pepsi was gaining ground, partly thanks to the "Pepsi Challenge" - blind sip tests where consumers often preferred Pepsi's sweeter taste. Coke R&D developed a new, sweeter formula that also beat both Pepsi and original Coke in these single-sip taste tests involving many thousands of consumers. Based on this data, they launched "New Coke" in 1985.
The result was a legendary disaster. Outrage, protests, hoarding of the original formula. The problem was people didn't just sip Coke; they drank whole cans. They also valued the brand, the history and the familiarity - factors the narrow taste tests completely missed. Within months, "Coca-Cola Classic" was back. New Coke production was quietly scaled back in the early 90s, but stuck around in a few markets until the early 00s.
I think AI practitioners are starting to learn the same lesson. We're tuning our models with RLHF/DPO/other preference methods based on similar one-step blind taste tests. Raters pick the "better" response between two options, often optimizing for immediate helpfulness, agreeableness, or perceived safety in that isolated interaction. I think some of the more extreme recent LLM tuning may also be fueled by taste-test-style benchmarks like LMSYS and the Artificial Analysis image leaderboard.
Examples: ChatGPT's most recent update turned it into an overenthusiastic sycophant. Image models (Apple's Image Playground model is a particularly egregious example you can try right now) are frequently preference tuned until every generation looks like something out of a Pixar movie. Certain music models are incapable of generating music that doesn't sound like a 2020s top-40s song.
In all cases, it might taste/sound/look good once, but ultimately people will get sick of it. I work on generative models and I think (at least for our modality, music) the most enduring enjoyment of using them is the element of surprise and delight, which is increasingly being ruined by preference tuning which collapses the distribution of possible outputs.
Are we optimizing away the very qualities that make these models interesting, creative, and truthful in the long run, just to win the immediate "preference" taste test and rank higher in benchmarks? IMO we're witnessing the New Coke of AI.
techpineapple · 1d ago
I think you're metaphor is a bit convoluted :-) but I think you're theory here is important.
There are a lot of concerns I have with AI in this area. For "facts", i.e. who were the signers of the declaration of independence, efficient search with one answer is probably fine, but for anything remotely controversial, the idea that we're going to accept the one answer of the AI is really problematic, and I think will lead to what I've been calling AI-Think (ala group think)
This is sort of directly adjacent to what you're describing. Instead of reading perspectives by different bloggers, or different sources, you'll be getting all of your perspectives from an identical sort of worldview, instead of browsing a feed with a whole bunch of different personalities, it's basically a feed of one, and as you're describing, the feed will be defenitionally middle-ground milquetoast.
Paul Graham said that he was replacing most of his google searches with chatgpt, and I trust that Paul Graham is a smart guy aware of this risk, but it does make me wonder, if you're using AI to do all the research for your writing, how does that affect you're writing when your sources become a monoculture. How does it collectively affect all art inspired by AI?
zaptrem · 1d ago
Fighting back against one set of AI opinions for everyone (which tended to be criticized as woke) to better reflect the “vibes” of the user is also part of what got the most recent 4o release to start enthusiastically agreeing with flat earthers. But if it’s not allowed to state opinions at all then you get really high refusal rates which annoys everyone.
techpineapple · 1d ago
Both of these are bad options though, reflecting the "vibes" of the user is worse than ChatGPT having it's own crappy opinions, because at least you're exposed to something different. But it's not even about the opinion, it's not enough for Chat GPT to say "Some people believe in free markets and others would like a more centrally controlled economy" it's the diverse aesthetics or patina that are important.
For those <40: In the 1980s senior executives at Coca Cola had a problem: Pepsi was gaining ground, partly thanks to the "Pepsi Challenge" - blind sip tests where consumers often preferred Pepsi's sweeter taste. Coke R&D developed a new, sweeter formula that also beat both Pepsi and original Coke in these single-sip taste tests involving many thousands of consumers. Based on this data, they launched "New Coke" in 1985.
The result was a legendary disaster. Outrage, protests, hoarding of the original formula. The problem was people didn't just sip Coke; they drank whole cans. They also valued the brand, the history and the familiarity - factors the narrow taste tests completely missed. Within months, "Coca-Cola Classic" was back. New Coke production was quietly scaled back in the early 90s, but stuck around in a few markets until the early 00s.
I think AI practitioners are starting to learn the same lesson. We're tuning our models with RLHF/DPO/other preference methods based on similar one-step blind taste tests. Raters pick the "better" response between two options, often optimizing for immediate helpfulness, agreeableness, or perceived safety in that isolated interaction. I think some of the more extreme recent LLM tuning may also be fueled by taste-test-style benchmarks like LMSYS and the Artificial Analysis image leaderboard.
Examples: ChatGPT's most recent update turned it into an overenthusiastic sycophant. Image models (Apple's Image Playground model is a particularly egregious example you can try right now) are frequently preference tuned until every generation looks like something out of a Pixar movie. Certain music models are incapable of generating music that doesn't sound like a 2020s top-40s song.
In all cases, it might taste/sound/look good once, but ultimately people will get sick of it. I work on generative models and I think (at least for our modality, music) the most enduring enjoyment of using them is the element of surprise and delight, which is increasingly being ruined by preference tuning which collapses the distribution of possible outputs.
Are we optimizing away the very qualities that make these models interesting, creative, and truthful in the long run, just to win the immediate "preference" taste test and rank higher in benchmarks? IMO we're witnessing the New Coke of AI.
There are a lot of concerns I have with AI in this area. For "facts", i.e. who were the signers of the declaration of independence, efficient search with one answer is probably fine, but for anything remotely controversial, the idea that we're going to accept the one answer of the AI is really problematic, and I think will lead to what I've been calling AI-Think (ala group think)
This is sort of directly adjacent to what you're describing. Instead of reading perspectives by different bloggers, or different sources, you'll be getting all of your perspectives from an identical sort of worldview, instead of browsing a feed with a whole bunch of different personalities, it's basically a feed of one, and as you're describing, the feed will be defenitionally middle-ground milquetoast.
Paul Graham said that he was replacing most of his google searches with chatgpt, and I trust that Paul Graham is a smart guy aware of this risk, but it does make me wonder, if you're using AI to do all the research for your writing, how does that affect you're writing when your sources become a monoculture. How does it collectively affect all art inspired by AI?