Show HN: I was curious about spherical helix, ended up making this visualization (visualrambling.space)

It is cheap and performant enough to run 100k queries. (Took a bit over a day and cost around 30 Euros for a major document classification task). Yes in theory this could have been done with fine-tuned BERT or maybe even with some older methods but it saved way too much time.

There is another factor that may explain why Flash is #1 in most categories on OpenRouter - Flash has gotten reasonably decent at less common human languages.

Most cheap (including Flash Lite) and local models mostly have English focused training.

karmakaze · 25m ago

This was my initial assessment as well. Also note:

> Grok I forgot about until it was too late.

I was surprised by how much I prefer Grok to others. Even its persona is how I prefer it, detailed without volunteering unwanted information or sycophanty. In general I'd use Grok-3 more than 4 which is good enough for common uses.

I suspect that Claude would be best, only if I gave it a long complex task with enough instructions up front so it could grind away on it while I was doing something else and not waiting on it.

rplnt · 3h ago

> Almost all models got almost all my evaluations correct

I find this the most surprising. I have yet to cross 50% threshold of bullshit to possibly truth. In any kind of topic I use LLMs for.

simonw · 2h ago

It's useful to build up an intuition for what kind of questions LLMs can answer and what kind of questions they can't.

Once you've done that your success rate goes way up.

prism56 · 1h ago

I'm pretty new to AI and have access to a few models in Kagi. I just never know which to pick, kind of annoys me I might not be using the best

EagnaIonat · 4h ago

> To access their best models via the API, OpenAI now requires you to complete a Know-You-Customer process similar to opening a bank account.

While this is true, you can download the OpenAI open source model and run it in Ollama.

The thinking is a little slow, but the results have been exceptional vs other local models.

https://ollama.com/library/gpt-oss

0x457 · 3h ago

openai/gpt-oss-120b is in this blog post.

JSR_FDED · 3h ago

Which of these can I run locally on a 64GB Mac Mini Pro? And how much does quantization affect the quality?

simonw · 2h ago

I use a 64GB M2 MacBook Pro. I tend to find any model that's smaller than 32B works well (I can just about run a 70B but it's not worth it as I have to quit all other apps first).

My current favorite to run on my machine is OpenAI's gpt-oss-20b because it only uses 11GB of RAM and it's designed to run at that quantization size.

I also really like playing with the Qwen 3 family at various sizes and I'm fond of Mistral Small 3.2 as a vision LLM that works well.

JSR_FDED · 1h ago

Thanks. Do you get any value from those for coding?

simonw · 1h ago

Only when I'm offline (on planes for example) - I've had both Mistral Small and gpt-oss-20b be useful for Python and JavaScript stuff.

If I have an internet connection I'll use GPT-5 or Claude 4 or Gemini 2.5 instead - they're better and they don't need me to dedicate a quarter of my RAM or run down my battery.

giancarlostoro · 5h ago

Him using different ones is why I use Perplexity, I get to try different models and honestly its pretty darn decent, gives me everything in an organized way, I can see all the different links, and all the files it outputs can be downloaded as a simple zip file. It has everything from GTP5 to Deepseek R1 and even Grok.

There's other sites similar to perplexity that host multiple models as well, I have not tried the plethora of others, I feel like Perplexity does the most to make sure whatever model you pick it works right for you and all its output is usefully catalogued.

faangguyindia · 4h ago

i use gemini flash and pro for pretty much everything. Why? they offer it free to test.

I tried signup for openai wayy too much friction, they start asking for payment without even you using any free credits, guess what that's one sure way to lose business.

same for claude, i couldn't even get claude through vertex as its available only in limited regions, and i am in asia pasific right now.

sandreas · 5h ago

This is an interesting overview, thank you. Different tasks, different models, all-day-usage and pretty complete (while still opinionated, which I like).

However, checking the results my personal overall winner if I had to pick only ONE probably would be

  deepseek/deepseek-chat-v3-0324

which is a good compromise between fast, cheap and good :-) Only for specific tasks (write a poem...) I would prefer a thinking model.

thorum · 3h ago

> Six of the eleven picked the same movie

This is surely the greatest weakness of current LLMs for any task needing a spark of creativity.

Timwi · 2h ago

This is definitely something very early LLMs could do that has kind of gotten beat out of them. I used to ask ChatGPT to simulate a text adventure game, but now if you try that you always get exactly the same one.

sireat · 1h ago

Curious, what kind of prompt gives you the same text adventure game?

Surely it is a question of prompting some context(in UI mode) or with additional kicker of temperature (if using API)?

At the very least some set up prompt such as "Give me 5 scenarios for text adventure game" would break the sameness?

There have always been theories that OpenAI and other LLM providers cache some responses - this could be one hypothesis.

karmakaze · 39m ago

I'm now imagining 5 hipster AIs writing those stories--different in predictable ways.

AWS CEO says using AI to replace junior staff is 'Dumbest thing I've ever heard' (theregister.com)

Anna's Archive: An Update from the Team (annas-archive.org)

FFmpeg 8.0 (ffmpeg.org)

Show HN: I was curious about spherical helix, ended up making this visualization (visualrambling.space)

AGENTS.md – Open format for guiding coding agents (agents.md)

Why are anime catgirls blocking my access to the Linux kernel? (lock.cmpxchg8b.com)

Copilot broke audit logs, but Microsoft won't tell customers (pistachioapp.com)

Mark Zuckerberg freezes AI hiring amid bubble fears (telegraph.co.uk)

DeepSeek-v3.1 (api-docs.deepseek.com)

AI tooling must be disclosed for contributions (github.com)

Obsidian Bases (help.obsidian.md)

How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos (research.kudelskisecurity.com)

Web apps in a single, portable, self-updating, vanilla HTML file (hyperclay.com)

OpenMower – An open source lawn mower (github.com)

Go is still not good (blog.habets.se)

Show HN: Whispering – Open-source, local-first dictation you can trust (github.com)

U.S. government takes 10% stake in Intel (cnbc.com)

Waymo granted permit to begin testing in New York City (cnbc.com)

Zedless: Zed fork focused on privacy and being local-first (github.com)

How to Draw a Space Invader (muffinman.io)

Ask HN: Why does the US Visa application website do a port-scan of my network?

Show HN: NextDNS Adds "Bypass Age Verification"

Counter-Strike: A billion-dollar game built in a dorm room (nytimes.com)

The Enterprise Experience (churchofturing.github.io)

Claudia – Desktop companion for Claude code (claudiacode.com)

Weaponizing image scaling against production AI systems (blog.trailofbits.com)

Io_uring, kTLS and Rust for zero syscall HTTPS server (blog.habets.se)

D4D4 (nmichaels.org)

Show HN: OverType – A Markdown WYSIWYG editor that's just a textarea

An interactive guide to SVG paths (joshwcomeau.com)

Sequoia backs Zed (zed.dev)

Left to Right Programming (graic.net)

D2 (text to diagram tool) now supports ASCII renders (d2lang.com)

Pixel 10 Phones (blog.google)

ArchiveTeam has finished archiving all goo.gl short links (tracker.archiveteam.org)

FFmpeg Assembly Language Lessons (github.com)

Home Depot sued for 'secretly' using facial recognition at self-checkouts (petapixel.com)

Gemma 3 270M re-implemented in pure PyTorch for local tinkering (github.com)

95% of Companies See 'Zero Return' on $30B Generative AI Spend (thedailyadda.com)

Manim: Animation engine for explanatory math videos (github.com)

4chan will refuse to pay daily online safety fines, lawyer tells BBC (bbc.co.uk)

Show HN: OS X Mavericks Forever (mavericksforever.com)

How to Think About GPUs (jax-ml.github.io)

Code review can be better (tigerbeetle.com)

"Remove mentions of XSLT from the html spec" (github.com)

T-Mobile claimed selling location data without consent is legal–judges disagree (arstechnica.com)

AWS in 2025: Stuff you think you know that's now wrong (lastweekinaws.com)

What is going on right now? (catskull.net)

Code formatting comes to uv experimentally (pydevtools.com)

Leaving Gmail for Mailbox.org (giuliomagnifico.blog)

Evaluating LLMs for my personal use case

Comments (18)