Show HN: I was curious about spherical helix, ended up making this visualization (visualrambling.space)

For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow.

./llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CPU"

More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1

hodgehog11 · 4h ago

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

segmondy · 2h ago

garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.

there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.

coliveira · 3h ago

My personal experience is that it produces high quality results.

amrrs · 3h ago

Any example or prompt you use to make this statment?

imachine1980_ · 3h ago

I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book. i don't use it for coding, but for things that are more unique i feel is more precise.

mycall · 2h ago

I wonder if Conway's law is at all responsible for that, in the similarity it is based on; regional trained data which has concept biases which it sends back in response.

sync · 1h ago

I'm doing coreference resolution and this model (w/o thinking) performs at the Gemini 2.5-Pro level (w/ thinking_budget set to -1) at a fraction of the cost.

dr_dshiv · 25m ago

Strong claim there!

guluarte · 3h ago

tbh companies like anthopic, openai, create custom agents for specific benchmarks

amelius · 1h ago

Aren't good benchmarks supposed to be secret?

wkat4242 · 27m ago

This industry is currently burning billions a month. With that much money around I don't think any secrets can exist.

bazmattaz · 3h ago

Do you have a source for this? I’m intrigued

guluarte · 3h ago

https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a... "we iteratively refine prompting by analyzing failure cases and developing prompts to address them."

seunosewa · 4h ago

The DeepSeek R1 in that list is the old model that's been replaced. Update: Understood.

yorwba · 3h ago

Yes, and 31.3% is given in the announcement as the performance of the new v3.1, which would put it in sixteenth place.

No comments yet

tonyhart7 · 2h ago

Yeah but the pricing is insane, I don't care about SOTA if its not break my bank

YetAnotherNick · 3h ago

Depends on the agent. Rank 5 and 15 are claude 4 sonnet, and this stands close to 15th.

seunosewa · 4h ago

It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.

darrinm · 2h ago

Did you try the strict (beta) function calling? https://api-docs.deepseek.com/guides/function_calling

ivape · 3h ago

What formats? I thought the very schema of json is what allows these LLMs to enforce structured outputs at the decoder level? I guess you can do it with any format, but why stray from json?

seunosewa · 3h ago

Sometimes it will randomly generate something like this in the body of the text: ``` <tool_call>executeshell <arg_key>command</arg_key> <arg_value>echo "" >> novels/AI_Voodoo_Romance/chapter-1-a-new-dawn.txt</arg_value> </tool_call> ```

or this: ``` <｜toolcallsbegin｜><｜toolcallbegin｜>executeshell<｜toolsep｜>{"command": "pwd && ls -la"}<｜toolcallend｜><｜toolcallsend｜> ```

Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.

No comments yet

refulgentis · 1h ago

In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema. Generally, the training is doing 99% of the work, of course, it's just "strict" means "we'll check it's work to the point a GBNF grammar created from the schema will validate."

One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)

dragonwriter · 1h ago

> In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema.

I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.

I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.

esafak · 4h ago

It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoning

Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1

bigyabai · 4h ago

Those Qwen3 2507 models are the local creme-de-la-creme right now. If you've got any sort of GPU and ~32gb of RAM to play with, the A3B one is great for pair-programming tasks.

indigodaddy · 16m ago

Do we get these good qwen models when using qwen-code CLI tool and authing via qwen.ai account?

decide1000 · 3h ago

I use it on a 24gb gpu Tesla P40. Very happy with the result.

hkt · 2h ago

Out of interest, roughly how many tokens per second do you get on that?

edude03 · 2h ago

Like 4. Definitely single digit. The P40s are slow af

pdimitar · 3h ago

Do you happen to know if it can be run via an eGPU enclosure with f.ex. RTX 5090 inside, under Linux?

I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.

oktoberpaard · 3h ago

I’m running Ollama on 2 eGPUs over Thunderbolt. Works well for me. You’re still dealing with an NVDIA device, of course. The connection type is not going to change that hassle.

pdimitar · 3h ago

Thank you for the validation. As much as I don't like NVIDIA's shenanigans on Linux, having a local LLM is very tempting and I might put my ideological problems to rest over it.

Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?

arcanemachiner · 2h ago

Yes, Ollama is very plug-and-play when it comes to multi GPU.

llama.cpp probably is too, but I haven't tried it with a bigger model yet.

gunalx · 3h ago

You would still need drivers and all the stuff difficult with nvidia in linux with a egpu. (Its not nessecarily terrible just suboptimal) Rather just add the second GPU in the Workstation, or just run the llm in your AMD GPU.

pdimitar · 3h ago

Oh, we can run LLMs efficiently with AMD GPUs now? Pretty cool, I haven't been following, thank you.

DarkFuture · 2h ago

I've been running LLM models on my Radeon 7600 XT 16GB for past 2-3 months without issues (Windows 11). I've been using llama.cpp only. The only thing from AMD I installed (apart from latest Radeon drivers) is the "AMD HIP SDK" (very straight forward installer). After unzipping (the zip from GitHub releases page must contain hip-radeon in the name) all I do is this:

llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf

And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.

bigyabai · 3h ago

Sure, though you'll be bottlenecked by the interconnect speed if you're tiling between system memory and the dGPU memory. That shouldn't be an issue for the 30B model, but would definitely be an issue for the 480B-sized models.

tomr75 · 3h ago

With qwen code?

xmichael909 · 1h ago

Seems to hallucinate more than any model I've ever worked with in the past 6 months.

dude250711 · 1h ago

Did they "borrow" bad data this time?

dr_dshiv · 22m ago

Cheep!

$0.56 per million tokens in — and $1.68 per million tokens out.

jbellis · 1h ago

About halfway between V3 and Qwen3 Coder.

https://brokk.ai/power-ranking?version=openround-2025-08-20&...

indigodaddy · 12m ago

Is gpt-5 Mini free from any providers?

abtinf · 3h ago

Unrelated, but it would really be nice to have a chart breaking down Price Per Token Per Second for various model, prompt, and hardware combinations.

imranq · 1h ago

There is one: https://pricepertoken.com/

rapind · 57m ago

Claude's Opus pricing is nuts. I'd be surprised if anyone uses it without the top max subscription.

theuurviv467456 · 2h ago

Sweet. I wish there guys weren't bound by the idiotic "nationalist" () bans so that they could do their work unrestricted.

Only idiots who are completely drowned in US's dark propaganda would think this is about anything but keeping China down.

hopelite · 1h ago

This does not make any sense to me. “There”? “‘Nationalist’ () bans” of and by whom?

Dark propaganda opposed to what, light propaganda? The Chinese model being released is about keeping China down?

You seem very animated about this, but you would probably have more success if you tried to clarify this a bit more.

tonyhart7 · 2h ago

Every country acting in its own best interest, US is not unique in this regard

wait until you find out that China also acting the same way toward the rest of the world (surprise pikachu face)

simianparrot · 2h ago

As if the CCP needs help keeping its own people down. Please.

tehjoker · 1h ago

Incredible how "keeping their people down" means leaps in personal wealth and happiness for huge swathes of the population and internal criticism is that it is a "poverty reduction machine" that is too focused.

jamiek88 · 49m ago

Tell that to the Uyghurs if you can get into their concentration camp to have a chat.

jaggs · 46m ago

Yep, comment arrives right on time. Nicely played. :)

GPT-5 (openai.com)

Fight Chat Control (fightchatcontrol.eu)

GitHub is no longer independent at Microsoft after CEO resignation (theverge.com)

I tried every todo app and ended up with a .txt file (al3rez.com)

Claude Sonnet 4 now supports 1M tokens of context (anthropic.com)

I want everything local – Building my offline AI workspace (instavm.io)

Ultrathin business card runs a fluid simulation (github.com)

AWS CEO says using AI to replace junior staff is 'Dumbest thing I've ever heard' (theregister.com)

Streaming services are driving viewers back to piracy (theguardian.com)

Wikipedia loses challenge against Online Safety Act (bbc.com)

Anna's Archive: An Update from the Team (annas-archive.org)

FFmpeg 8.0 adds Whisper support (code.ffmpeg.org)

Emailing a one-time code is worse than passwords (blog.danielh.cc)

Debian 13 “Trixie” (debian.org)

Steve Wozniak: Life to me was never about accomplishment, but about happiness (yro.slashdot.org)

Good system design (seangoedecke.com)

Vibechart (vibechart.net)

Why LLMs can't really build software (zed.dev)

VC-backed company just killed my EU trademark for a small OSS project

Claude Code is all you need (dwyer.co.za)

Show HN: I was curious about spherical helix, ended up making this visualization (visualrambling.space)

Gemma 3 270M: Compact model for hyper-efficient AI (developers.googleblog.com)

AGENTS.md – Open format for guiding coding agents (agents.md)

Nginx introduces native support for ACME protocol (blog.nginx.org)

Copilot broke audit logs, but Microsoft won't tell customers (pistachioapp.com)

Show HN: The current sky at your approximate location, as a CSS gradient (sky.dlazaro.ca)

Claude says “You're absolutely right!” about everything (github.com)

Why are anime catgirls blocking my access to the Linux kernel? (lock.cmpxchg8b.com)

PYX: The next step in Python packaging (astral.sh)

Open hardware desktop 3D printing is dead? (josefprusa.com)

How I code with AI on a budget/free (wuu73.org)

Show HN: Building a web search engine from scratch with 3B neural embeddings (blog.wilsonl.in)

Obsidian Bases (help.obsidian.md)

How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos (research.kudelskisecurity.com)

Show HN: I built an app to block Shorts and Reels (scrollguard.app)

Mark Zuckerberg freezes AI hiring amid bubble fears (telegraph.co.uk)

Try and (ygdp.yale.edu)

This website is for humans (localghost.dev)

I accidentally became PureGym’s unofficial Apple Wallet developer (drobinin.com)

GPT-5: Key characteristics, pricing and system card (simonwillison.net)

Wikimedia Foundation Challenges UK Online Safety Act Regulations (wikimediafoundation.org)

Web apps in a single, portable, self-updating, vanilla HTML file (hyperclay.com)

OpenFreeMap survived 100k requests per second (blog.hyperknot.com)

Lazy-brush – smooth drawing with mouse or finger (lazybrush.dulnan.net)

OpenMower – An open source lawn mower (github.com)

Jim Lovell, Apollo 13 commander, has died (nasa.gov)

Search all text in New York City (alltext.nyc)

Show HN: Whispering – Open-source, local-first dictation you can trust (github.com)

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

What's the strongest AI model you can train on a laptop in five minutes? (seangoedecke.com)

DeepSeek-v3.1 Release

Comments (54)