Show HN: AI at Risk, a silly LLM benchmark

1 crimsoneer 0 8/2/2025, 6:59:04 PM ai-at-play.online ↗

Hey HN! Thought I'd share this side project I've been working on in the last couple of weeks: 4 AI agents play the classic board game Risk, with make belief personas (Genghis Khan doing great, Captain Jack Sparrow not so much) and randomly selected models.

I added the new "cloaked" Horizon Alpha model last week, and it has been absolutely decimating the competition (I've also just added Horizon Beta, so we'll see how it does).

It's a lot more fun than a robust experiment, but I've found the interactions really interesting. If you'd like more detail, you can also read my blog post here: https://andreasthinks.me/posts/ai-at-play/

State Of AI: Year 3 of the hype. Things I learned (nickyreinert.de)

Show HN: SQRZ: An addictive fast-paced matching game, built in Godot (jasonjmcghee.github.io)

Life before demos (or, Hobbyist Programming in the 1980s) (oldskool.org)

Jack Smith under investigation over Trump prosecutions (independent.co.uk)

Scientists analyze 76M radio telescope images, find Starlink interference (livescience.com)

HTML-in-Canvas (github.com)

I Built an Infinite DJ based on what you Code it got weird [video] (youtube.com)

In Defense of the Traditional Review (newyorker.com)

Phone Is a Snitch – Untraceable Digital Dissident (untraceabledigitaldissident.com)

Why reliability is hard at scale: learnings from infrastructure outages (newsletter.pragmaticengineer.com)

Running Gaming Workloads Through AMD's Zen 5 (chipsandcheese.com)

Theory of Scale-Relative Time: Derivations of the Galactic Scale Factor (zenodo.org)

The best companies are dictatorships (writing.nikunjk.com)

Show HN: Fetchet – A compact, promise-based, HTTP fetch wrapper (github.com)

Are prompts the new unit of work for applications? (archgw.com)

AI Thinking, Fast and Slow (danmu.nz)

Quantum Interference 1: A Simple Example (profmattstrassler.com)

UK gets first female Astronomer Royal in 350 years (bbc.com)

Multi-cloud migration startup FluidCloud emerges from stealth (networkworld.com)

A third of Chinese provinces now spend their entire revenue on debt repayments (old.reddit.com)

No Gravity – FOSS space game classic from 2005 ported for the web (midzer.de)

Google indexing ChatGPT convos, potentially exposing sensitive user data (fastcompany.com)

Should we treat rivers as living things? (nature.com)

Arch-Router: Aligning LLM Routing with Human Preferences (arxiv.org)

Lina Khan points to Figma IPO as vindication of M&A scrutiny (techcrunch.com)

Someone on GitHub filed a bug report against reality, says P=NP cause causality (github.com)

<IsAgent/> (stytch.com)

Cyberpunk Is Now Our Reality (danieltan.weblog.lol)

Show HN: Rudys.ai, Scale Google Ads Globally in Any Language (rudys.ai)

Show HN: NameFast – Generate names for your SaaS idea in seconds

Exfiltrating Your ChatGPT Chat History and Memories with Prompt Injection (embracethered.com)

As a linguist, I want to find the words to measure chronic illness (thesicktimes.org)

Show HN: Fast Elevation API with memory mapped tiles (terraintap.com)

A New Jersey Racing Institution Could Be Destroyed for Housing Development (thedrive.com)

Immigration officers smash car windows to speed up arrests (projects.propublica.org)

Pentomino Configurations and Solutions (isomerdesign.com)

Vibe-Coding Yourself into Irrelevance (osnews.com)

UK Energy Trading Market Infographic (a115.co.uk)

Ask HN: Anyone else getting forced auto-dubbing on YouTube Android app?

John Douglas, 9th Marquess of Queensberry (en.wikipedia.org)

Show HN: TwoAuth – an offline, zero-data 2FA Authenticator (two-auth.vercel.app)

URL Shorteners Are Poison for the Web (whtwnd.com)

In the Name of Progress (kudmitry.com)

Jack Dorsey's Goose and Secret Model Is Wild [video] (youtube.com)

Show HN: A fast proxy checker and IP rotator with ease (github.com)

Show HN: A Virtual friend that checks up on you, every morning (jurnai.site)

Robotics Levels of Autonomy – SemiAnalysis (semianalysis.com)

A record-breaking baby has been born from an embryo that's over 30 years old (technologyreview.com)

Poor Mans Lovable (github.com)

Copyright

Show HN: AI at Risk, a silly LLM benchmark

Comments (0)