Request for Comment on 2025 Minimum Elements for a Software Bill of Materials [pdf] (public-inspection.federalregister.gov)

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

tonyhart7 · 2m ago

Yeah but the pricing is insane, I don't care about SOTA if its not break my bank

segmondy · 19m ago

garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.

there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.

coliveira · 1h ago

My personal experience is that it produces high quality results.

amrrs · 1h ago

Any example or prompt you use to make this statment?

imachine1980_ · 1h ago

I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book. i don't use it for coding, but for things that are more unique i feel is more precise.

guluarte · 1h ago

tbh companies like anthopic, openai, create custom agents for specific benchmarks

bazmattaz · 50m ago

Do you have a source for this? I’m intrigued

guluarte · 36m ago

https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a... "we iteratively refine prompting by analyzing failure cases and developing prompts to address them."

seunosewa · 1h ago

The DeepSeek R1 in that list is the old model that's been replaced. Update: Understood.

yorwba · 1h ago

Yes, and 31.3% is given in the announcement as the performance of the new v3.1, which would put it in sixteenth place.

No comments yet

YetAnotherNick · 1h ago

Depends on the agent. Rank 5 and 15 are claude 4 sonnet, and this stands close to 15th.

seunosewa · 1h ago

It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.

ivape · 1h ago

What formats? I thought the very schema of json is what allows these LLMs to enforce structured outputs at the decoder level? I guess you can do it with any format, but why stray from json?

seunosewa · 45m ago

Sometimes it will randomly generate something like this in the body of the text: ``` <tool_call>executeshell <arg_key>command</arg_key> <arg_value>echo "" >> novels/AI_Voodoo_Romance/chapter-1-a-new-dawn.txt</arg_value> </tool_call> ```

or this: ``` <｜toolcallsbegin｜><｜toolcallbegin｜>executeshell<｜toolsep｜>{"command": "pwd && ls -la"}<｜toolcallend｜><｜toolcallsend｜> ```

Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.

ivape · 33m ago

Have you checked out Outlines?

https://dottxt-ai.github.io/outlines/latest/

esafak · 1h ago

It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoning

Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1

bigyabai · 1h ago

Those Qwen3 2507 models are the local creme-de-la-creme right now. If you've got any sort of GPU and ~32gb of RAM to play with, the A3B one is great for pair-programming tasks.

decide1000 · 54m ago

I use it on a 24gb gpu Tesla P40. Very happy with the result.

hkt · 29m ago

Out of interest, roughly how many tokens per second do you get on that?

edude03 · 21m ago

Like 4. Definitely single digit. The P40s are slow af

pdimitar · 1h ago

Do you happen to know if it can be run via an eGPU enclosure with f.ex. RTX 5090 inside, under Linux?

I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.

oktoberpaard · 49m ago

I’m running Ollama on 2 eGPUs over Thunderbolt. Works well for me. You’re still dealing with an NVDIA device, of course. The connection type is not going to change that hassle.

pdimitar · 46m ago

Thank you for the validation. As much as I don't like NVIDIA's shenanigans on Linux, having a local LLM is very tempting and I might put my ideological problems to rest over it.

Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?

gunalx · 1h ago

You would still need drivers and all the stuff difficult with nvidia in linux with a egpu. (Its not nessecarily terrible just suboptimal) Rather just add the second GPU in the Workstation, or just run the llm in your AMD GPU.

pdimitar · 1h ago

Oh, we can run LLMs efficiently with AMD GPUs now? Pretty cool, I haven't been following, thank you.

DarkFuture · 14m ago

I've been running LLM models on my Radeon 7600 XT 16GB for past 2-3 months without issues (Windows 11). I've been using llama.cpp only. The only thing from AMD I installed (apart from latest Radeon drivers) is the "AMD HIP SDK" (very straight forward installer). After unzipping (the zip from GitHub releases page must contain hip-radeon in the name) all I do is this:

llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf

And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.

bigyabai · 1h ago

Sure, though you'll be bottlenecked by the interconnect speed if you're tiling between system memory and the dGPU memory. That shouldn't be an issue for the 30B model, but would definitely be an issue for the 480B-sized models.

tomr75 · 59m ago

With qwen code?

abtinf · 43m ago

Unrelated, but it would really be nice to have a chart breaking down Price Per Token Per Second for various model, prompt, and hardware combinations.

theuurviv467456 · 20m ago

Sweet. I wish there guys weren't bound by the idiotic "nationalist" () bans so that they could do their work unrestricted.

Only idiots who are completely drowned in US's dark propaganda would think this is about anything but keeping China down.

simianparrot · 19m ago

As if the CCP needs help keeping its own people down. Please.

Why Latency Lies to You (systemssaturday.substack.com)

Ask HN: What tools are you spending your money on?

The State of AI in Business 2025 [pdf] (mlq.ai)

Show HN: Simulate any code being typed (codeplayback.art)

Openpilot 0.10 (blog.comma.ai)

Walmart says tariff costs are rising 'each week' and will continue (npr.org)

Crapboard (crapboard.com)

Google Is Already Using the Future AI Network You Might Get in 2028 (nextplatform.com)

The World Bank's report also highlighted Ghana's broader economic recovery (jphfeeds.top)

Tally Game (tally-game.com)

How the Myth of the Coequal Branches Became the Norm (aei.org)

Free feedback widget for your website (sitewidgets.app)

US DOJ to back off money transmitter cases in shift backed by crypto (reuters.com)

Donald Trump's fantasy of home-grown chipmaking (economist.com)

Four ways learning Econ makes people dumber re: future AI (lesswrong.com)

Ask HN: Do you think LLMs will rapidly accelerate "hard" fields? Why / why not

At the top of the market, EV hypercars are a disappearing breed (arstechnica.com)

Mirage 2 – Generative World Engine (demo.dynamicslab.ai)

'Cheapfake' AI celeb videos are rage baiting people on YouTube (wired.com)

Avalon: ASR for Human–AI Interaction (withaqua.com)

YOCaml, a framework for static site generator (discuss.ocaml.org)

Request for Comment on 2025 Minimum Elements for a Software Bill of Materials [pdf] (public-inspection.federalregister.gov)

Why do we collect? U of A study shows it's about seeking structure (news.arizona.edu)

Wired and Business Insider remove 'AI-written' freelance articles (pressgazette.co.uk)

Tech hiring is not dead, no matter what the headlines say [video] (youtube.com)

AI can write your app, but not center a div (nuanced.dev)

China cut itself off from the global internet for an hour on Wednesday (theregister.com)

Spurious correlations (correlation is not causation) (tylervigen.com)

One of the Best Solar Eruptions in Years Is Caught on Camera (petapixel.com)

Libre-Chip Awarded NLnet Grant to Prototype a CPU Isn't Vulnerable to Spectre (phoronix.com)

Agent Inference – A user-agent / browser quiz (ai.174070135.xyz)

A Man Accidentally Bought 10,400 Greg Briley Rookie Cards (defector.com)

Case Study: Migrating a Rush.js Monorepo to Node Type Stripping (blog.calm.com)

California resident tests positive for plague, health official say (fox40.com)

Video games became a way to connect to a father who struggled to express himself (crossplay.news)

How is it possible to run Wordpad by just typing its name? (2011) (devblogs.microsoft.com)

Show HN: Splice – CAD for Cable Harnesses and Electrical Assemblies (splice-cad.com)

I Have No Mut and I Must Borrow (old.reddit.com)

MCP Gateway: Self-hostable routing proxy for AI agent traffic to MCP servers (github.com)

Popular cities people are moving to this summer (qz.com)

Five neural net vocal AI running 100% local on a single M1 (twitter.com)

The Mindset is an Intel 80186-based MS-DOS personal computer (en.wikipedia.org)

OpenAI Is Poised to Become the Most Valuable Startup Ever. Should It Be? (wired.com)

The Pursuit of Life Where It Seems Unimaginable (quantamagazine.org)

A simple AI Swiss knife CLI tool (github.com)

Wired and Business Insider remove articles by AI-generated 'freelancer' (theguardian.com)

Executive Order Establishing US Chief Design Officer (whitehouse.gov)

Hackers who exposed North Korean government hacker explain why they did it (techcrunch.com)

Tesla is slow in reporting crashes and the feds have launched an investigation (apnews.com)

The addictive trap of AI-assisted writing (bowendwelle.substack.com)

DeepSeek-v3.1 Release

Comments (32)