Model F104 Classic Style Keyboard – Buckling Spring Perfection at Last? (2024) (nerdlypleasures.blogspot.com)

Key insight: treat each LLM evaluation as a noisy sample, then use confidence intervals to decide when to stop sampling. The math shows reliability is surprisingly cheap (95%→99% confidence only costs 1.7x more), but precision is expensive (doubling scale granularity costs 4x more).

Also implemented "mixed-expert sampling" - rotating through multiple models (GPT-4, Claude, etc.) in the same batch for better robustness.

Analyzed how latency, cost and reliability scale in this approach.

Typical result: need 5-20 samples instead of guessing. Especially useful for AI safety evals and model comparisons where reliability matters.

Code: https://github.com/sunnybak/precision-based-sampling Blog: https://www.sunnybak.net/blog/precision-based-sampling

Comments (0)

No comments yet

Ton Roosendaal to step down as Blender chairman and CEO (cgchannel.com)

Fired by the Post (twitter.com)

Show HN: Web-based 2D geometry calculator (ccorcos.github.io)

How People Use ChatGPT (nber.org)

Iiscv (Lisp-based Version Control System) – image-based (github.com)

Examining the 'Lump of Labor' Fallacy Using a Simple Economic Model (2020) (stlouisfed.org)

Show HN: Mixture of Voices–Open source goal-based AI router-uses BGE transformer

Horizontal Scrolling Containers Are Not a Content Strategy (adrianroselli.com)

Hamilton–Jacobi–Bellman is just linear duality (guille.site)

I'm Creating a TradingView Alternative (old.reddit.com)

'Privatisation premium': billions from UK energy bills paid to shareholders (theguardian.com)

Model F104 Classic Style Keyboard – Buckling Spring Perfection at Last? (2024) (nerdlypleasures.blogspot.com)

AI job loss – example, perhaps

An Introduction to Ada [video] (youtube.com)

How to easily switch your PC from Windows to Linux Mint – for free (zdnet.com)

Sneakers Film Promotional Floppy (archive.org)

What to know about zarfs, the fanciest way to drink coffee (npr.org)

Water Vapor Could Cool Your Next iPhone (spectrum.ieee.org)

Tariff threat plays havoc with US PC market, economy not helping (theregister.com)

LIGO Verifies Hawking's Theorem (as.cornell.edu)

Linux Mint 22.2 "Zara" released (blog.linuxmint.com)

Show HN: Canteen – Enabling Claude/Cursor to hire world class tech talent (recruiting.thecanteenapp.com)

Bootc: Boot and Upgrade via Container Images (github.com)

Generating RSS Feeds from Web Pages with RSS Please (2022) (wezm.net)

Ants Found a Loophole for a Fundamental Rule of Life (nytimes.com)

People are more likely to cheat when they delegate tasks to AI (nature.com)

Show HN: Built a super lite(18MB) native macOS screen recorder app with autozoom (cursorclip.com)

Tyler Vigen's Spurious Correlations (tylervigen.com)

Ask HN: What web software do you pay for as a consumer?

Why we haven't seen a Stripe-sized company in analytics (risogroup.co)

Nestle unveils method to boost cocoa yields as climate change hits (sg.news.yahoo.com)

RSPAX: Mirror Tokens for common SpaceX stock (republic.com)

Using Apollo in Svelte 5 (apollo-runes-docs.vercel.app)

I Want More Risk-Taking in Philanthropy (arbesman.substack.com)

A 3D-Printed Business Card Embosser (core77.com)

Mini: The Minimal Language (minilanguage.com)

What happens when we outsource human intimacy to engagement algorithms? (syntheticauth.ai)

Kingdom Code BUILD: 10th anniversary Christian hackathon in London (kingdomcode.org.uk)

Apple's New Wireless Protocol: SPR AVS (patentlyapple.com)

Show HN: Zero-downtime Kubernetes workload migration using CRIU (cast.ai)

More Than Just a Game (medium.com)

Fiverr is laying off 250 employees to become an 'AI-first company' (engadget.com)

Ask HN: How to network/connect with fellow devs?

Framework Desktop and Linux have shown me the path to PC gaming in living room (theverge.com)

2XKO – Riot Games' Revolutionary Tag-Team Fighting Game (2xko.pro)

ARDC is Recruiting Volunteers for their 2026 Committees (ardc.net)

Apple Watch Gets New Mode to Extend Battery Life – But for Kids Only (macrumors.com)

Cybersecurity programs don't prevent employees from falling for phishing scams (techxplore.com)

Launch HN: RunRL (YC X25) – Reinforcement learning as a service (runrl.com)

Granite docling 258M: a small multimodal model for efficient document conversion (huggingface.co)

Precision-Based Sampling of LLM Judges

Comments (0)