The latest 10 hour podcast from Lex Summarized to 18 bullets (andrewarrow.dev)

A consensus of confidence and self-consistency through consensus seem fine for certain kinds of tasks, especially ones that involve recalling training. This is a bit like scaling up determinism, obedience, collectivism vs individualism... which seems fine for many math problems. Neither confidence or consensus are the best way to confirm truth or accuracy in a generalized way.

The previous self-consistency approach and this confidence pruning approach aren't really novel, but it's nice to see the numbers run. Fundamentally these approaches are about handling contradicting results, but not resolving the contradictions or increasing the quality of reasoning. What if the rare idea is the right answer? You can squeeze the training juice harder, but if you still get the wrong answer when it really really mattered, you're just left with a stress toy in your hand.

carbocation · 1h ago

One thing that is confusing about this write-up is that "DeepConf-low" is only mentioned once and in a screenshot, but it seems to outperform DeepConf-high for several tasks. I guess I'll need to read the underlying paper, but that seems troublesome.

cubefox · 1h ago

It's likely confusing because it was written by an LLM.

TurboSkyline · 1h ago

Is this article written by an LLM?

vpribish · 1h ago

sure looks like it was. If they cant bother to write it i'm for sure not going to read it.

ChrisMarshallNY · 1h ago

I'm not sure I'd see things the same way. Lot of work went into it; even if the final was LLMed. The result is quite readable.

The authors seem to be Chinese, and may not be that confident in their English. I suspect that we'll be seeing a lot more of this kind of stuff, as time goes on.

carbocation · 1h ago

I don't think disclosure is necessary, but I think it can build trust in cases like this. "Please note that we used an LLM to rewrite our initial English draft." The reason to do this is that then people don't waste cycles wondering about the answer to this question.

ChrisMarshallNY · 1h ago

I agree. Their LLMed English is much better than my Chinese.

Also, some of the very worst English I've ever read, has been technical prose, written by born-and-bred native English speakers with very high educational credentials.

Clear communication is important. The best idea on Earth, is worthless, if it can't be articulated well.

cubefox · 1h ago

> Lot of work went into it; even if the final was LLMed.

No, it was fully or almost fully LLM generated. See: https://arxiviq.substack.com/p/coming-soon

ChrisMarshallNY · 29m ago

So the LLM did all the research? From that posting, it sounds like they accepted a human-made paper, and LLMed it, themselves. The authors are not to blame at all.

If otherwise, then it looks like The Singularity has arrived.

nowittyusername · 1h ago

Correct me if I am wrong, but by the looks of things on that chart the reduction in token use and the better score are all related to the fact that this method used 512 samples.... This doesn't seem to be of any use for local running agents or anything that has severe vram restrictions such as local models that people can run at home. So this would only benefit enterprise level systems no?

Der_Einzige · 1h ago

This is inference time scaling where as it tries to generate a sample which through logprobs "looks wrong" it early cutsoff. It has a vLLM implementation which is easy to install and use. You can apply the technique to some 4bit model 7b model on your old laptop tier nvidia GPU easily.

Well, the folks on this website think installing vLLM (pip install vLLM...) is hard and that ollama - a far slower and shittier inference engine - is better. Enormous damage has been done to the hobbyist LLM ecosystem due to folks not knowing what tools work on what platform.

The one exception is for mac peasants where llama.cpp is still probably the best implementation, but if you have nvidia and you're not using sglang or vLLM, you're doing it wrong.

But this is of ENORMOUS use for folks who want to run tiny models at home. Go to bed wake up with a K=512 solution answer to your problem.

jxf · 1h ago

As someone who is operating an enterprise platform that uses vLLM in the stack, it's immensely harder than "pip install vllm" to have it working at scale and kept up to date.

vlovich123 · 1h ago

If you think getting VLLM working correctly is just a pip install vllm, you haven’t tried it in very many environments.

cubefox · 1h ago

This article, like all articles on this substack, is LLM generated.

Source: https://arxiviq.substack.com/p/coming-soon

che_shr_cat · 31m ago

I'm the author of this blog. That's correct, the texts are generated and then validated manually by me.

I also do manual reviews (https://gonzoml.substack.com/), but there are many more papers for which I don't have time to write a review. So I created a multi-agentic system to help me, and I'm constantly iterating to improve it. And I like the result. It was also validated by the paper authors a couple of times, they agree the reviews are correct. So, if you see something is definitely wrong, please let me know.

Regarding myself, I became at least x10 more productive in reading papers and understanding what's happening. Hope, it will also help some of you.

yoouareperfect · 1h ago

What's the difference with lowering the temperature?

furyofantares · 1h ago

I think ideally you want the whole path to be the most probable path, which is not likely to be the same as taking the most probable token at each step.

It's not remotely practical to select the most probable path but you can do a little bit of search a few tokens at a time.

nickandbro · 1h ago

Wonder what this means for the pelican riding on a bicycle test? Or will it just be good at strictly reasoning type problems.

evertedsphere · 18m ago

yet again i am asking for a mandatory "(LLM output)" label in the title like we do for pdf/video links

The latest 10 hour podcast from Lex Summarized to 18 bullets (andrewarrow.dev)

People stuck using ancient Windows computers (bbc.com)

Why the Internet Is "Dangerous" (1995) (spectacle.org)

Thinking Across Languages (hedgehogreview.com)

Show HN: VoNote – AI Conversation Recorder for iOS (apps.apple.com)

Show HN: I built Mix – an opensource, local agent for multimodal tasks (github.com)

Unauthorized OpenAI Equity Transactions (openai.com)

Why 'rocks as big as cars' are flying down the Dolomites (bbc.com)

The Economics of Stagflation, Part II – Paul Krugman (paulkrugman.substack.com)

Ask HN: What would you look for in a hobbyist alternative to IDA Pro?

Lichess Is Down (lichess.org)

Malvertising campaign targets 300 companies with Atomic macOS Stealer (scworld.com)

Optimizing Factorio startup performance when running on a hard disk (aa55.dev)

Secret Management on NixOS with sops-Nix (michael.stapelberg.ch)

CPAP Medical Data Breach Impacts 90k People (securityweek.com)

Consuming the Delta Lake Change Data Feed for CDC (clickhouse.com)

Escaping the chains of tethered products: the Juice Rescue project (natematias.com)

Intel Employee Data Exposed by Vulnerabilities (securityweek.com)

Mobilizon team got new funding, which changes and fixes would you like? (framacolibri.org)

Bioinspired Design of Ergonomic Handles Using 3D-Printed Cellular Metamaterials (mdpi.com)

Free Idea for AI Entrepreneur

M-x spook (Emacs mail amusement) (gnu.org)

Rivian R1S suddenly went into limp mode: 4mph on a 8-lane intersection (old.reddit.com)

Evaluation of Hashing Algorithms Ascon, SHA256, SHA512 and BLAKE3 on a Cortex M7 (aa55.dev)

Zod Codecs (colinhacks.com)

The Illustrated GPT-OSS (newsletter.languagemodels.co)

Show HN: AsyncFlow – stop guessing p95 latency, simulate your system before prod

Congress's Technology Prerogatives (hollomancer.substack.com)

Additions with unique digits: a tale of puzzling and AI (simia.net)

A collection of community tools and resources (github.com)

'Sneakflation': how tariffs are gradually raising costs for American consumers (cnn.com)

AR Glasses Need Ultrasound UI for Broad Adoption – EE Times (eetimes.com)

At the 'Edge of the World' in Portugal (nytimes.com)

A better way to think about AI (theatlantic.com)

Is 4chan the perfect Pirate Bay poster child to justify wider UK site-blocking? (torrentfreak.com)

Europe's Quantum Strategy: Urgency Beyond the Blueprint (eetimes.eu)

Marcuswu/makercad: Go library for supporting makercad node module (github.com)

Africa Is Buying a Record Number of Chinese Solar Panels (wired.com)

Which Programming Language Should I Teach First? (parentheticallyspeaking.org)

Show HN: Classic Lode Runner native game (github.com)

Exploring the Benefits and Risks of AI-Powered Innovation (knowledge.wharton.upenn.edu)

Ralph Wiggum as a "Software Engineer" (ghuntley.com)

The Day I Logged in with a Hash (gsociety.fr)

Burkina Faso rejects Bill Gates' plan to create mosquito species (africa.businessinsider.com)

The big Intel investment comes from already awarded grants (techcrunch.com)

Futarchy's Fundamental Flaw (dynomight.net)

Pebble 2 Duo is in mass production (ericmigi.com)

Virtual Boy at 30: The legacy of Nintendo's biggest console flop (polygon.com)

Unicode Misconceptions (jean.abou-samra.fr)

Solar car teams in 3,000km race across Australian outback (theguardian.com)

Deep Think with Confidence

Comments (22)