For local runs, I made some GGUFs! You need around RAM + VRAM >= 250GB for good perf for dynamic 2bit (2bit MoE, 6-8bit rest) - can also do SSD offloading but it'll be slow.
Was that document almost exclusively written with LLMs? I looked at it last night (~8 hours ago) and it was riddled with mistakes, most egregious was that the "Run with Ollama" section had instructions for how to install Ollama, but then the shell commands were actually running llama.cpp, a mistake probably no human would make.
Do you have any plans on disclosing how much of these docs are written by humans vs not?
Regardless, thanks for the continued release of quants and weights :)
wfn · 1h ago
> but then the shell commands were actually running llama.cpp, a mistake probably no human would make.
But in the docs I see things like
cp llama.cpp/build/bin/llama-* llama.cpp
Wouldn't this explain that? (Didn't look too deep)
danielhanchen · 23h ago
Oh hey sorry the docs are still in construction! Are you referring to merging GGUFs to Ollama - it should work fine! Ie:
Ollama can only allow merged GGUFs (not splitted ones), so hence the command.
All docs are made by humans (primarily my brother and me), just sometimes there might be some typos (sorry in advance)
I'm also uploading Ollama compatible versions directly so ollama run can work (it'll take a few more hours)
pshirshov · 1d ago
By the way, I'm wondering why unsloth (a goddamn python library) tries to run apt-get with sudo (and fails on my nixos). Like how tf we are supposed to use that?
danielhanchen · 1d ago
Oh hey I'm assuming this is for conversion to GGUF after a finetune? If you need to quantize to GGUF Q4_K_M, we have to compile llama.cpp, hence apt-get and compiling llama.cpp within a Python shell.
There is a way to convert to Q8_0, BF16, F16 without compiling llama.cpp, and it's enabled if you use `FastModel` and not on `FastLanguageModel`
Essentially I try to do `sudo apt-get` if it fails then `apt-get` and if all fails, it just fails. We need `build-essential cmake curl libcurl4-openssl-dev`
It seems Unsloth is useful and popular, and you seem responsive and helpful. I'd be down to try to improve this and maybe package Unsloth for Nix as well, if you're up for reviewing and answering questions; seems fun.
Imo it's best to just depend on the required fork of llama.cpp at build time (or not) according to some configuration. Installing things at runtime is nuts (especially if it means modifying the existing install path). But if you don't want to do that, I think this would also be an improvement:
- see if llama.cpp is on the PATH and already has the requisite features
- if not, check /etc/os-release to determine distro
- if unavailable, guess distro class based on the presence of high-level package managers (apt, dnf, yum, zypper, pacman) on the PATH
- bail, explain the problem to the user, give copy/paste-friendly instructions at the end of we managed to figure out where we're running
Is either sort of change potentially agreeable enough that you'd be happy to review it?
(2) Installing via apt-get will ask user's input() for permission
(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp
mFixman · 1d ago
Maybe it's a personal preference, but I don't want external programs to ever touch my package manager, even with permission. Besides, this will fail loudly for systems that don't use `apt-get`.
I would just ask the user to install the package, and _maybe_ show the command line to install it (but never run it).
lucideer · 1d ago
I don't think this should be a personal preference, I think it should be a standard*.
That said, it does at least seem like these recent changes are a large step in the right direction.
---
* in terms of what the standard approach should be, we live in an imperfect world and package management has been done "wrong" in many ecosystems, but in an ideal world I think the "correct" solution here should be:
(1) If it's an end user tool it should be a self contained binary or it should be a system package installed via the package manager (which will manage any ancillary dependencies for you)
(2) If it's a dev tool (which, if you're cloning a cpp repo & building binaries, it is), it should not touch anything systemwide. Whatsoever.
This often results in a README with manual instructions to install deps, but there are many good automated ways to approach this. E.g. for CPP this is a solved problem with Conan Profiles. However that might incur significant maintenace overhead for the Unsloth guys if it's not something the ggml guys support. A dockerised build is another potential option here, though that would still require the user to have some kind of container engine installed, so still not 100% ideal.
danielhanchen · 1d ago
I would like to be in (1) but I'm not a packaging person so I'll need to investigate more :(
(2) I might make the message on installing llama.cpp maybe more informative - ie instead of re-directing people to the docs on manual compilation ie https://docs.unsloth.ai/basics/troubleshooting-and-faqs#how-..., I might actually print out a longer message in the Python cell entirely
That will be nice too, though I was more just referring to simply doing something along the lines of this in your current build:
docker run conanio/gcc11-ubuntu16.04 make clean -C llama.cpp etc etc...
(likely mounting & calling a sh file instead of passing individual commands)
---
Although I do think getting the ggml guys to support Conan (or monkey patching your own llama conanfile in before building) might be an easier route.
danielhanchen · 1d ago
Oh ok I'll take a look at conanfiles as well - sorry I'm not familiar with them!
danielhanchen · 1d ago
Hopefully the solution for now is a compromise if that works? It will show the command as well, so if not accepted, typing no will error out and tell the user on how to install the package
solarkraft · 1d ago
I like it when software does work for me.
Quietly installing stuff at runtime is shady for sure, but why not if I consent?
danielhanchen · 23h ago
Do you think it's ok for permissioning I guess? I might also add a 30 second timer and just bail out if there's no response from the user
danielhanchen · 1d ago
Thanks for the suggestions! Apologies again I'm pretty bad at packaging, so hence the current setup.
3. Agreed on bailing - I was also thinking if doing a Python input() with a 30 second waiting period for apt-get if that's ok? We tell the user we will apt-get some packages (only if apt exists) (no sudo), and after 30 seconds, it'll just error out
4. I will remove sudo immediately (ie now), and temporarily just do (3)
But more than happy to fix this asap - again sorry on me being dumb
mkl · 1d ago
It shouldn't install any packages itself. Just print out a message about the missing packages and your guess of the command to install them, then exit. That way users can run the command themselves if it's appropriate or add the packages to their container build or whatever. People set up machines in a lot of different ways, and automatically installing things is going to mess that up.
danielhanchen · 1d ago
Hmmm so I should get rid of the asking / permissions message?
mkl · 1d ago
Yes, since you won't actually need the permissions.
danielhanchen · 1d ago
Hmmm I'm worried people will really not get on how to install / compile / use the terminal hmmm hence I thought permissions were like a compromise solution
solarkraft · 1d ago
I think that it is, quite a good one, even:
- Determine the command that has to be run by the algorithm above.
This does most of the work a user would have to figure out what has to be installed on their system.
- Ask whether to run the command automatically.
This allows the “software should never install dependencies by itself” crowd to say no and figure out further steps, while allowing people who just want it to work to get on with their task as quickly as possible (who do you think there are more of?).
I think it would be fine to print out the command and force the user to run it themselves, but it would bring little material gain at the cost of some of your users’ peace (“oh no it failed, what is it this time ...”).
danielhanchen · 23h ago
Oh ok! I would say 50% of people manually install llama.cpp and the other 50% want it to be automated
segmondy · 1d ago
Don't listen to this crowd, these are "technical folks". Most of your audience will fail to figure it out. You can provide an option that llama.cpp is missing and give them an option where you auto install it or they can install it themselves and do manual configuration. I personally won't tho.
Computer0 · 23h ago
Who do you think the audience is here if not technical. We are in a discussion about a model that requires over 250gb of ram to run. I don't know a non-technical person with more than 32gb.
pxc · 18h ago
I think most of the people like this in the ML world are extreme specialists (e.g.: bioinformaticians, statisticians, linguists, data scientists) who are "technical" in some ways but aren't really "computer people". They're power users in a sense but they're also prone to strange bouts of computing insanity and/or helplessness.
danielhanchen · 1d ago
I think for a compromise solution I'll allow the permission asking to install. I'll definitely try investigating pre built binaries though
solarkraft · 1d ago
This is an edge case optimization at the cost of 95% of users.
mkl · 1d ago
95% of users probably won't be using Linux. Most of those who are will have no problem installing dependencies. There are too many distributions and ways of setting them up for automated package manager use to be the right thing to do. I have never seen a Python package even try.
Balinares · 1d ago
I'll venture that whoever is going to fine-tune their own models probably already has llama.cpp installed somewhere, or can install if required.
Please, please, never silently attempt to mutate the state of my machine, that is not a good practice at all and will break things more often than it will help because you don't know how the machine is set up in the first place.
danielhanchen · 1d ago
Oh yes so before we install llama.cpp we do an path environment check and if its not defined then it'll install.
But yes agreed there won't be any more random package installs sorry!
Balinares · 1d ago
Thanks for the reply! If I can find the time (that's a pretty big if), I'll try to send a PR to help with the packaging.
danielhanchen · 23h ago
No worries :)
elteto · 1d ago
Dude, this is NEVER ok. What in the world??? A third party LIBRARY running sudo commands? That’s just insane.
You just fail and print a nice error message telling the user exactly what they need to do, including the exact apt command or whatever that they need to run.
(2) Installing via apt-get will ask user's input() for permission
(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp
Again apologies on my dumbness and thanks for pointing it out!
danielhanchen · 1d ago
Yes I had that at the start, but people kept complaining they don't know how to actually run terminal commands, hence the shortcut :(
I was thinking if I can do it during the pip install or via setup.py which will do the apt-get instead.
As a fallback, I'll probably for now remove shell executions and just warn the user
rfoo · 1d ago
IMO the correct thing to do to make these people happy, while being sane, is - do not build llama.cpp on their system. Instead, bundle a portable llama.cpp binary along with unsloth, so that when they install unsloth with `pip` (or `uv`) they get it.
Some people may prefer using whatever llama.cpp in $PATH, it's okay to support that, though I'd say doing so may lead to more confused noob users spam - they may just have an outdated version lurking in $PATH.
Doing so makes unsloth wheel platform-dependent, if this is too much of a burden, then maybe you can just package llama.cpp binary and have it on PyPI, like how scipy guys maintain a https://pypi.org/project/cmake/ on PyPI (yes, you can `pip install cmake`), and then depends on it (maybe in an optional group, I see you already have a lot due to cuda shit).
danielhanchen · 1d ago
Oh yes I was working on providing binaries together with pip - currently we're relying on pyproject.toml, but once we utilize setup.py (I think), using binaries gets much simpler
I'm still working on it, but sadly I'm not a packaging person so progress has been nearly zero :(
ffsm8 · 1d ago
I think you misunderstood rfoos suggestion slightly.
From how I interpreted it, he meant you could create a new python package, this would effectively be the binary you need.
In your current package, you could depend on the new one, and through that - pull in the binary.
This would let you easily decouple your package from the binary,too - so it'd be easy to update the binary to latest even without pushing a new version of your original package
I've maintained release pipelines before and handled packaging in a previous job, but I'm not particularly into the python ecosystem, so take this with a grain of salt: an approach would be
Pip Packages :
* Unsloth: current package, prefers using unsloth-llama, and uses path llama-cpp as fallback (with error msg as final fallback if neither exist, promoting install for unsloth-llama)
* Unsloth-llama: new package which only bundles the llama cpp binary
I was trying to see if I could pre-compile some llama.cpp binaries then save them as a zip file (I'm a noob sorry) - but I definitely need to investigate further on how to do python pip binaries
Don't worry. Don't let the rednecks screaming here affect you. As for one, I'm happy that you have automated this part and sad to see it is going away. People will always complain. It might be reasonable feedback worth acting upon. Don't let their tone distract you though. Some of them are just angry all day.
danielhanchen · 1d ago
Thanks - hopefully the compromise solution ie python input asking for user permissions works ok?
rpdillon · 1d ago
As a guy that would naturally be in the camp of "installing packages is never okay", I also live in the more practical world where people want things to work. I think the compromise you're suggesting is a pretty good one. I think the highest quality implementation here would be.
Try to find prebuilt and download.
See if you can compile from source if a compiler is installed.
If no compiler: prompt to install via sudo apt and explaining why, also give option to abort and have the user install a compiler themselves.
This isn't perfect, but limits the cases where prompting is necessary.
danielhanchen · 23h ago
I'm going to see if I can make prebuilt versions work :) But thanks!
devin · 1d ago
Don't optimize for these people.
danielhanchen · 1d ago
Yep agreed - I primarily thought it was a reasonable "hack", but it's pretty bad security wise, so apologies again.
The current solution hopefully is in between - ie sudo is gone, apt-get will run only after the user agrees by pressing enter, and if it fails, it'll tell the user to read docs on installing llama.cpp
woile · 1d ago
Don't apologize, you are doing amazing work. I appreciate the effort you put.
Usually you don't make assumptions on the host OS, just try to find the things you need and if not, fail, ideally with good feedback. If you want to provide the "hack", you can still do it, but ideally behind a flag, `allow_installation` or something like that. This is, if you want your code to reach broader audiences.
shaan7 · 1d ago
Yep there's no need to apologize, you've been very courteous and took all that feedback constructively. Good stuff :)
danielhanchen · 1d ago
Thank you!
danielhanchen · 1d ago
Thank you! :)
pxc · 1d ago
How unusual is this for the ecosystem?
pshirshov · 1d ago
Unfortunately, Python ecosystem is the Wild West.
lyu07282 · 1d ago
on a meta level its kind of worrying for the ecosystem that there is nothing in PyPI that blocks & bans developers who try to run sudo on setup. I get they don't have the resources to do manual checks, but literally no checks against malicious packages?
danielhanchen · 1d ago
Sadly not - you can run anything within a python shell since there's os system, subprocess popen and exec - it's actually very common for setup.py files where installers execute commands
But I do agree maybe for better security pypi should check for commands and warn
pshirshov · 1d ago
It won't work well if you deal with non ubuntu+cuda combination. Better just fail with a reasonable message.
But I'm working on more cross platform docs as well!
pshirshov · 1d ago
My current solution is to pack llama.cpp as a custom nix formula (the one in nixpkgs has the conversion script broken) and run it myself. I wasn't able to run unsloth on ROCM nor for inference nor for conversion, sticking with peft for now but I'll attempt again to re-package it.
I'm working with the AMD folks to make the process easier, but it looks like first I have to move off from pyproject.toml to setup.py (allows building binaries)
pshirshov · 1d ago
Yes, it's trivial with the pre-built vllm docker, but I need a declarative way to configure my environment. The lack of prebuilt rocm wheels for vllm is the main hindrance for now but I was shocked to see the sudo apt-get in your code. Ideally, llama.cpp should publish their gguf python library and the conversion script to pypi with every release, so you can just add that stuff as a dependency. vllm should start publishing a rocm wheel, after that unsloth would need to start publishing two versions - a cuda one and a rocm one.
danielhanchen · 1d ago
Yes apologies again - yes rocm is still an issue
exe34 · 1d ago
hey fellow crazy person! slight tangent: one thing that helps keep me grounded with "LLMs are doing much more than regurgitation" is watching them try to get things to work on nixos - and hitting every rake on the way to hell!
nixos is such a great way to expose code doing things it shouldn't be doing.
pshirshov · 1d ago
In my experience LLMs can do Nix very well, even the models I run locally. I just instruct them to pull dependencies through flake.nix and use direnv to run stuff.
exe34 · 22h ago
oh yes they do nix very well, but I asked cursor to set up a firecracker vm with networking for exposing a port on the host, and use conda inside to install a certain version of python with some libraries. I asked for a firecracker-vm.nix, a build.sh, a run.sh and a close.sh. it kept trying to run code inside its own fhs-env, which would run, and then when I tried it outside of the fhs, it would fail. I'd paste in the errors and it would without fail say oh let's try the proper nix version of python - which I explicitly didn't want, because I wanted to run conda versions on other machines. I tried to guide it through conda-shell but didn't get very far. in the end I ended up using docker instead, which it did set up without fail.
but when it was failing on my original idea, it kept trying dumb things that weren't really even nix after a while.
danielhanchen · 1d ago
I'm glad someone commented and tried it out - appreciate it immensely - I learnt a lot today :) I'm definitely gonna give nixos a spin as well!
zargon · 1d ago
Thanks for your great work with quants. I would really appreciate UD GGUFs for V3.1-Base (and even more so, GLM-4.5-Base + Air-Base).
danielhanchen · 1d ago
Thanks! Oh base models? Interesting since I normally do only Instruct models - I can take a look though!
azinman2 · 23h ago
It’d also be great if you guys could do a fine tune to run on an 8x80G A/H100. These H200/B200 configs are harder to come by (and much more expensive).
danielhanchen · 23h ago
Unsloth should work on any GPU setup all the way until the old Tesla T4s and the newer B200s :) We're working on a faster and better multi GPU version, but using accelerate / torchrun manually + Unsloth should work out of the box!
azinman2 · 23h ago
I guess I was hoping for you guys to put up these weights. I think they’d be popular for these very large models.
You guys already do a lot for the local LLM community and I appreciate it.
efilife · 1d ago
>250GB, how do you guys run this stuff?
danielhanchen · 1d ago
I'm working on sub 165GB ones!
165GB will need a 24GB GPU + 141GB of RAM for reasonably fast inference or a Mac
tw1984 · 1d ago
for such dynamic 2bit, is there any benchmark results showing how many performance I would give up compared to the original model? thanks.
danielhanchen · 1d ago
Currently no, but I'm running them! Some people on the aider discord are running some benchmarks!
cowpig · 22h ago
@danielhanchen do you publish the benchmarks you run anywhere?
segmondy · 1d ago
if you are running a 2bit quant, you are not giving up performance but gaining 100% performance since the alternative is usually 0%. Smaller quants are for folks who won't be able to run anything at all, so you run the largest you can run relative to your hardware. I for instance often ran Q3_K_L, I don't think of how much performance I'm giving up, but rather how without Q3, I won't be able to run it at all. With that said, for R1, I did some tests against 2 public interfaces and my local Q3 crushed them. The problem with a lot of model providers is we can never be sure what they are serving up and could take shortcuts to maximize profit.
danielhanchen · 23h ago
Oh Q3_K_L as in upcasted embed_tokens + lm_head to Q8_0? I normally do Q4 embed Q6 lm_head - would a Q8_0 be interesting?
linuxftw · 1d ago
That's true only in a vacuum. For example, should I run gpt-oss-20b unquantized or gpt-oss-120b quantaized? Some models have a 70b/30b spread, and that's only across a single base model, where many different models exist at different quants could be compared for different tasks.
jkingsman · 1d ago
Definitely. As a hobbyist, I have yet to put together a good heuristic for better-quant-lower-params vs. smaller-quant-high-params. I've mentally been drawing the line at around q4, but now with IQ quants and improvements in the space I'm not so sure anymore.
linuxftw · 1d ago
Yeah, I've kinda quickly thrown in the towel trying to figure out what's 'best' for smaller memory systems. As things are just moving so quickly, whatever time I invest into that is likely to be for nil.
danielhanchen · 23h ago
For GPT OSS in particular, OpenAI only released the MoEs in MXFP4 (4bit), so the "unquantized" version is 4bit MoE + 16bit attention - I uploaded "16bit" versions to https://huggingface.co/unsloth/gpt-oss-120b-GGUF, and they use 65.6GB whilst MXFP4 uses 63GB, so it's not that much difference - same with GPT OSS 20B
llama.cpp also unfortunately cannot quantize matrices that are not a multiple of 256 (2880)
hodgehog11 · 1d ago
For reference, here is the terminal-bench leaderboard:
Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.
segmondy · 1d ago
garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.
there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.
paradite · 1d ago
Hey. I like your roast on benchmarks.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
This industry is currently burning billions a month. With that much money around I don't think any secrets can exist.
noodletheworld · 1d ago
How can a benchmark be secret if you post it to an API to test a model on it?
"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"
:P
If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.
So no.
They're not secret.
dmos62 · 1d ago
How do you propose that would work? A pipeline that goes through query-response pairs to deduce response quality and then uses the low-quality responses for further training? Wouldn't you need a model that's already smart enough to tell that previous model's responses weren't smart enough? Sounds like a chicken and egg problem.
irthomasthomas · 1d ago
It just means that once you send your test questions to a model API, that company now has your test. So 'private' benchmarks take it on faith that the companies won't look at those requests and tune their models or prompts to beat them.
dmos62 · 1d ago
Sounds a bit presumptious to me. Sure, they have your needle, but they also need a cost-efficient way to find it in their hay stack.
lucianbr · 1d ago
They have quite large amounts of money. I don't think they need to be very cost-efficient. And they also have very smart people, so likely they can figure out a somewhat cost-efficient way. The stakes are high, for them.
noodletheworld · 1d ago
Security through obscurity is not security.
Your api key is linked to your credit card, which is linked to your identity.
…but hey, youre right.
Lets just trust them not to be cheating. Cool.
merelysounds · 1d ago
Would the model owners be able to identify the benchmarking session among many other similar requests?
irthomasthomas · 1d ago
Depends. Something like arc-agi might be easy as it follows a defined format. I would also guess that the usage pattern for someone running a benchmark will be quite distinct from that of a normal user, unless they take specific measures to try to blend in.
coliveira · 1d ago
My personal experience is that it produces high quality results.
amrrs · 1d ago
Any example or prompt you use to make this statment?
imachine1980_ · 1d ago
I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book.
i don't use it for coding, but for things that are more unique i feel is more precise.
mycall · 1d ago
I wonder if Conway's law is at all responsible for that, in the similarity it is based on; regional trained data which has concept biases which it sends back in response.
valtism · 1d ago
Was that true for GPT-5? They claim it is much better at not hallucinating
sync · 1d ago
I'm doing coreference resolution and this model (w/o thinking) performs at the Gemini 2.5-Pro level (w/ thinking_budget set to -1) at a fraction of the cost.
antman · 1d ago
Nice point. How did you test for coreference resolution? Specific prompt or dataset?
dr_dshiv · 1d ago
Strong claim there!
SV_BubbleTime · 1d ago
Vine is about the only benchmark I think is real.
We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?
seunosewa · 1d ago
The DeepSeek R1 in that list is the old model that's been replaced.
Update: Understood.
yorwba · 1d ago
Yes, and 31.3% is given in the announcement as the performance of the new v3.1, which would put it in sixteenth place.
No comments yet
YetAnotherNick · 1d ago
Depends on the agent. Rank 5 and 15 are claude 4 sonnet, and this stands close to 15th.
tonyhart7 · 1d ago
Yeah but the pricing is insane, I don't care about SOTA if its not break my bank
rsanek · 1d ago
Looks to be the ~same intelligence as gpt-oss-120B, but about 10x slower and 3x more expensive?
With all these things, it depends on your own eval suite. gpt-oss-120b works as well as o4-mini over my evals, which means I can run it via OpenRouter on Cerebras where it's SO DAMN FAST and like 1/5th the price of o4-mini.
indigodaddy · 1d ago
How would you compare gpt-oss-120b to (for coding):
Qwen3-Coder-480B-A35B-Instruct
GLM4.5 Air
Kimi K2
DeepSeek V3 0324 / R1 0528
GPT-5 Mini
Thanks for any feedback!
petesergeant · 1d ago
I’m afraid I don’t use any of those for coding
bigyabai · 23h ago
You're missing out. GLM 4.5 Air and Qwen3 A3B both blow OSS 120B out of the water in my experience.
indigodaddy · 23h ago
Ah good to hear! How about Qwen3-Coder-480B-A35B-Instruct? I believe that is the free Qwen3-coder model on openrouter
mdp2021 · 1d ago
> same intelligence as gpt-oss-120B
Let's hope not, because gpt-oss-120B can be dramatically moronical. I am guessing the MoE contains some very dumb subnets.
Benchmarks can be a starting point, but you really have to see how the results work for you.
okasaki · 1d ago
My experience is that gpt-oss doesn't know much about obscure topics, so if you're using it for anything except puzzles or coding in popular languages, it won't do well as the bigger models.
It's knowledge seems to be lacking even compared to gpt3.
No idea how you'd benchmark this though.
xadhominemx · 1d ago
> My experience is that gpt-oss doesn't know much about obscure topics
That is the point of these small models. Remove the bloat of obscure information (address that with RAG), leaving behind a core “reasoning” skeleton.
okasaki · 1d ago
Yeah I guess. Just wanted to say the size difference might be accounted for by the model knowing more.
Seems more user-friendly to bake it in.
easygenes · 1d ago
Something I was doing informally that seems very effective is asking for details about smaller cities and towns and lesser points of interest around the world. Bigger models tend to have a much better understanding and knowledge base for the more obscure places.
scotty79 · 1d ago
I would really love if they figured out how to train a model that doesn't have any such knowledge baked it, but knows where to look for it. Maybe even has a clever database for that. Knowing this kind of trivia like this consistently of the top of your head is a sign of deranged mind, artificial or not.
bigmadshoe · 1d ago
The problem is that these models can't reason about what they do and do not know, so right now you basically need to tune it to:
1) always look up all trivia, or
2) occasionally look up trivia when it "seems complex" enough.
okasaki · 1d ago
Would that work as well? If I ask a big model to write like Shakespeare it just knows intuitively how to do that. If it didn't and had to look up how to do that, I'm not sure it would do a good job.
petesergeant · 1d ago
I don't think you're necessarily wrong, but your source is currently only showing a single provider. Comparing:
Those Qwen3 2507 models are the local creme-de-la-creme right now. If you've got any sort of GPU and ~32gb of RAM to play with, the A3B one is great for pair-programming tasks.
indigodaddy · 1d ago
Do we get these good qwen models when using qwen-code CLI tool and authing via qwen.ai account?
bigyabai · 23h ago
I'm not sure, probably?
esafak · 21h ago
You do not need qwen-code or qwen.ai to use them; openrouter + opencode suffice.
indigodaddy · 21h ago
Right, I'm aware, was just wondering about that specific scenario.
Do you happen to know if it can be run via an eGPU enclosure with f.ex. RTX 5090 inside, under Linux?
I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.
oktoberpaard · 1d ago
I’m running Ollama on 2 eGPUs over Thunderbolt. Works well for me. You’re still dealing with an NVDIA device, of course. The connection type is not going to change that hassle.
pdimitar · 1d ago
Thank you for the validation. As much as I don't like NVIDIA's shenanigans on Linux, having a local LLM is very tempting and I might put my ideological problems to rest over it.
Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?
arcanemachiner · 1d ago
Yes, Ollama is very plug-and-play when it comes to multi GPU.
llama.cpp probably is too, but I haven't tried it with a bigger model yet.
SV_BubbleTime · 1d ago
Even today some progress was released on parallelizing WAN video generation over multiple GPUs. LLMs are way easier to split up.
gunalx · 1d ago
You would still need drivers and all the stuff difficult with nvidia in linux with a egpu. (Its not nessecarily terrible just suboptimal) Rather just add the second GPU in the Workstation, or just run the llm in your AMD GPU.
pdimitar · 1d ago
Oh, we can run LLMs efficiently with AMD GPUs now? Pretty cool, I haven't been following, thank you.
DarkFuture · 1d ago
I've been running LLM models on my Radeon 7600 XT 16GB for past 2-3 months without issues (Windows 11). I've been using llama.cpp only. The only thing from AMD I installed (apart from latest Radeon drivers) is the "AMD HIP SDK" (very straight forward installer). After unzipping (the zip from GitHub releases page must contain hip-radeon in the name) all I do is this:
llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf
And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.
Plasmoid2000ad · 1d ago
Yes - I'm running a LM Studio on windows on a 6800xt, and everything works more-or-less out of the box using always using Vulkan llama.cpp on the gpu I believe.
There's also ROCm. That's not working for me in LM Studio at the moment. I used that early last year to get some LLMs and stable diffusion running. As far as I can tell, it was faster before, but Vulkan implementations have caught up or something - so much the mucking about isn't often worth it. I believe ROCm is hit or miss for a lot of people, especially on windows.
bavell · 1d ago
IDK about "efficiently" but we've been able to run llms locally with AMD for 1.5-2 years now
green7ea · 22h ago
llama.cpp and lmstudio have a Vulkan backend which is pretty fast. I'm using it to run models on a Strix Halo laptop and it works pretty well.
bigyabai · 1d ago
Sure, though you'll be bottlenecked by the interconnect speed if you're tiling between system memory and the dGPU memory. That shouldn't be an issue for the 30B model, but would definitely be an issue for the 480B-sized models.
decide1000 · 1d ago
I use it on a 24gb gpu Tesla P40. Very happy with the result.
hkt · 1d ago
Out of interest, roughly how many tokens per second do you get on that?
edude03 · 1d ago
Like 4. Definitely single digit. The P40s are slow af
coolspot · 1d ago
P40 has memory bandwidth of 346GB/s which means it should be able to do around 14+ t/s running a 24 GB model+context.
tomr75 · 1d ago
With qwen code?
epolanski · 1d ago
I too like Qwen a lot, it's one of the best models for programming, I generally use it via the chat.
seunosewa · 1d ago
It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.
What formats? I thought the very schema of json is what allows these LLMs to enforce structured outputs at the decoder level? I guess you can do it with any format, but why stray from json?
seunosewa · 1d ago
Sometimes it will randomly generate something like this in the body of the text:
```
<tool_call>executeshell
<arg_key>command</arg_key>
<arg_value>echo "" >> novels/AI_Voodoo_Romance/chapter-1-a-new-dawn.txt</arg_value>
</tool_call>
```
or this:
```
<|toolcallsbegin|><|toolcallbegin|>executeshell<|toolsep|>{"command": "pwd && ls -la"}<|toolcallend|><|toolcallsend|>
```
Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.
irthomasthomas · 1d ago
Can't you use logit bias to help with this? Might depend how they are tokenized.
ilaksh · 1d ago
Maybe you have your temperature turned up too high.
refulgentis · 1d ago
In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema. Generally, the training is doing 99% of the work, of course, it's just "strict" means "we'll check it's work to the point a GBNF grammar created from the schema will validate."
One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)
dragonwriter · 1d ago
> In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema.
I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.
I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.
refulgentis · 1d ago
I maintain a cross-platform llama.cpp client - you're right to point out that generally we expect nuking logits can take care of it.
There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.
Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling
* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params
* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...
7thpower · 1d ago
This is a basic question but maybe you can help: what is a good resource to use to understand how to take advantage of logits?
For OpenAI, you can just pass in the json_schema to activate it, no library needed. For direct LLM interfacing you will need to host your own LLM or use a cloud provider that allows you too hook in, but someone else may need to correct me on this.
If anyone is using anything other than Outlines, please let us know.
7thpower · 1h ago
Thank you!
dragonwriter · 1d ago
Thanks for the explanation!
dsign · 1d ago
Some of it is in Kagi already. Impressive from both DeepSeek and Kagi.
not sure if its just chat.deepseek.com but one strange thing I've noticed is that now it replies to like 90% of your questions with "Of course.", even when it doesnt fit the prompt at all. maybe it's the backend injecting it to be more obedient? but you can tell it `don't begin the reply to this with "of" ending "course"` and it will listen. it's very strange
Some people on reddit (very reliable source I know) are saying it was trained on a lot of Gemini and I can see that. for example it does that annoying thing gemini does now where when you use slang or really any informal terms it puts them in quotes in its reply
edg5000 · 1d ago
> for example it does that annoying thing gemini does now where when you use slang or really any informal terms it puts them in quotes in its reply
Haven´t used Gemini much, but the time I used it, it felt very academic and theoretical compared to Opus 4. So that seems to fit. But I'll have to do more evaluation of the non-Claude models to get a better idea of the differences.
pradn · 1d ago
All this points to "personality" being a big -- and sticky -- selling point for consumer-facing chat bots. People really did like the chatty, emoji-filled persona of the previous ChatGPT models. So OpenAI was ~forced to adjust GPT-5 to be closer to that style.
It raises a funny "innovator's dilemma" that might happen. Where an incumbent has to serve chatty consumers, and therefore gets little technical/professional training data. And a more sober workplace chatbot provider is able to advance past the incumbent because they have better training data. Or maybe in a more subtle way, chatbot personas give you access to varying market segments, and varying data flywheels.
xmichael909 · 1d ago
Seems to hallucinate more than any model I've ever worked with in the past 6 months.
Leynos · 1d ago
DeepSeek is bad for hallucinations in my experience. I wouldn't trust its output for anything serious without heavy grounding. It's great for fantastical fiction though. It also excels at giving characters "agency".
bgilroy26 · 1d ago
Where would you go to find people posting their AI generated fiction? I haven't been able to find it on Reddit
1gn15 · 1d ago
AO3 has several tags for it.
bgilroy26 · 1d ago
I should have said, I am looking for posted chat logs where the prompts are shared as well. I really enjoy the process of making stories with AI and I am curious to see how others do the same thing.
Leynos · 23h ago
Look at NovelCrafter.
They have a great Discord community where people share their prompts and workflows.
There is also the WritingWithAI subreddit.
bgilroy26 · 23h ago
Thank you!
CamperBob2 · 1d ago
Amazon, Barnes & Noble, Powell's, the usual places.
energy123 · 1d ago
What context length did you use?
dude250711 · 1d ago
Did they "borrow" bad data this time?
d4rkp4ttern · 1d ago
It’s a very smart move for DeepSeek to put out an Anthropic-compatible API, similar to Kimi-k2, GLM4.5 (Puzzled as to why Qwen didn’t do this). You can set up a simple function in your .zhsrc to run Claude-Code with these models:
Wow thanks! I just ran into my claude code session limit like an hour ago and tried the method you linked and added 10 CNY to a deepseek api account and an hour later i've got 7.77 CNY left and have used 3.3 million tokens.
I'm not confident enough to say it's as good as claude opus or even sonnet, but it seems not bad!
I did run into an api error when my context exceeded deepseek's 128k window and had to manually compact the context.
jodleif · 1d ago
Qwen have their own competitor to Claude Code.
vitaflo · 1d ago
Sad to see the off peak discount go. I was able to crank tokens like crazy and not have it cost anything. That said the pricing is still very very good so I can't complain too much.
abtinf · 1d ago
Unrelated, but it would really be nice to have a chart breaking down Price Per Token Per Second for various model, prompt, and hardware combinations.
Claude's Opus pricing is nuts. I'd be surprised if anyone uses it without the top max subscription.
tmoravec · 1d ago
FWIW I have the €20 Pro plan and exchange maybe 20 messages with Opus (with thinking) every day, including one weeks-long conversation. Plus a few dozen Sonnet tasks and occasionally light weight CC.
I'm not a programmer, though - engineering manager.
jjani · 1d ago
Sure I do, but not as part of any tools, just for one-off conversations where I know it's going to be the best out there. For tasks where reasoning helps little to none, it's often still number one.
memothon · 1d ago
Some people have startup credits
guerrilla · 1d ago
So, is the output price there why most models are extremely verbose? Is it just a ploy to make extra cash? It's super annoying that I have to constantly tell it to be more and more concise.
diggan · 1d ago
> It's super annoying that I have to constantly tell it to be more and more concise.
While system promting is the easy way of limiting the output in a somewhat predictable manner, have you tried setting `max_tokens` when doing inference? For me that works very well for constraining the output, if you set it to 100 you get very short answers while if you set it to 10,000 you can very long responses.
fariszr · 1d ago
Is it foot at tool use?
For me tool use is table stakes, if a model can't use tools then its almost useless.
snippai · 1d ago
Looks quite competitive among open-weight models, but I guess still behind GPT-5 or Claude a lot.
caycep · 21h ago
this might be OT and covered somewhere else but what's the latest/greatest on these models and their effect on the linguistics field, vs. what does the latest and greatness in linguistics feel about these models?
CuriouslyC · 1d ago
Cries in 128k context. Probably will be a good orchestrator though, can always delegate to Gemini.
donbreo · 1d ago
It still cant name all the states in India
Leynos · 1d ago
That's interesting. I am curious about the extent of the training data in these models.
I asked Kimi K2 for an account of growing up in my home town in Scotland, and it was ridiculously accurate. I then asked it to do the same for a similarly sized town in Kerala. ChatGPT suggested that while it was a good approximation, K2 got some of the specifics wrong.
dr_dshiv · 1d ago
Cheep!
$0.56 per million tokens in — and $1.68 per million tokens out.
NiekvdMaas · 1d ago
That's actually a big bump from the previous pricing: $0.27/$1.10
kenmacd · 1d ago
And unfortunately no more half price 8-hours a day either :(
manishsharan · 1d ago
The next cheapest and capable model is GLM 4.5 at $0.6 per million tokens in and $2.2 per million tokens out. Glad to see DeepSeek is still be the value king.
But I am sti disappointed with the price increase.
just saw this on Chinese internet - deepseek officially mentioned that v3.1 is trained using UE8M0 FP8 as that is the FP8 to be supported by the next gen Chinese AI chip. so basically -
some Chinese next gen AI chips is coming, deepseek is working with them to get its flagship model trained using such domestic chips.
interesting time ahead! just imagine what it could do to NVIDIA share price when deepseek releases a SOTA new model trained without using NVIDIA chips.
Alifatisk · 1d ago
Time to short Nvidia?
hangonhn · 23h ago
No because people never really talk about the quantity of the alternatives -- i.e. Huawei Ascent. Even if Huawei can match the quality, their yields are still abysmal. The numbers I've heard are in the hundreds of thousands vs. millions by Nvidia. In the near future, Nvidia's dominance is pretty secure. The only thing that can threaten it is if this whole AI thing isn't worth what some people imagined it is worth and people start to realize this.
ychan268 · 23h ago
No evidence v3.1 is trained on Chinese chips(they said very ambiguously, only said they adapted the model for Chinese chips, could be training, could be inference)
Anyway, from my experience, if China really has advanced AI chips for SOTA model, I am sure propaganda machine will go all out, look how they boasted Huawei CPU that’s two generations behind Qualcomm and TSMC
Narciss · 1d ago
V interesting, thanks for sharing
aussieguy1234 · 1d ago
They say the SWE bench verified score is 66%. Claude Sonnet 4 is 67%. Not sure if the 1% difference here is statistically significant or not.
I'll have to see how things go with this model after a week, once the hype has died down.
loog5566 · 1d ago
I'm doing this model
hereme888 · 1d ago
Reminder DeepSeek is a Chinese company whose headstart is attributed to stealing IP from American companies. Without the huge theft, they'd be nowhere.
pdntspa · 1d ago
As if those american companies played fair with training their AIs
It's theft all the way down, son
andreashaerter · 1d ago
I can't say whether those claims are true. But even if they were, it feels selective. Every major AI company trained on oceans of data they didn't create or own. The whole field was built on "borrowing" IP, open-source code, academic papers, datasets, art, text, you name it.
Drawing the line only now... saying this is where copying stops being okay doesn't seem very fair. No AI company is really in a position to whine about it from my POV (ignoring any lawyer POV). Cue the world's smallest violin
bobro · 1d ago
Can you contrast this with Western companies? What are the Chinese companies stealing that Western companies aren’t? Do you mean tech or content?
hereme888 · 1d ago
Ethics of Chinese vs. Western companies? Everything. I'm sure you're aware of how many hundreds of $billions of American IP are stolen by Chinese companies.
bobro · 1d ago
I’m not asking broadly about difference in ethics. I’m asking specifically about IP theft in the AI space.
hereme888 · 1d ago
I'm not aware of any proven IP theft by American companies in the AI space. Many pending legal challenges. None yet proven.
bobro · 22h ago
Alright. So there’s proven IP theft by the Chinese companies with completed legal proceedings?
hereme888 · 17h ago
With completed legal proceedings, at least two cases: Xiaolang Zhang was sentenced in 2024 for stealing Apple's AI autonomous vehicle tech for Chinese AI company XPeng. Xiang Haitao was sentenced in 2022 for stealing Monsanto's AI predictive algorithm for a Chinese research institute.
nancyminusone · 1d ago
Most ironic comment I've yet laid eyes on.
computerex · 1d ago
I find it hilarious you felt the need to make this comment in defense of American LLMs. You know that American LLMs aren’t trained ethically either, right? Many people’s data was used for training without their permission.
BTW DeepSeek has contributed a lot, with actual white papers describing in detail their optimizations. How are the rest of the American AI labs doing in contributing research and helping one another advance the field?
replete · 1d ago
Reminder that OpenAI is an American company whose headstart is attributed to stealing copyrighted material from everyone else. Without the huge theft, they'd be nowhere.
hereme888 · 1d ago
Last I checked, as it concerns the training of their models, all legal challenges are pending. No theft has yet been proven, as they used publicly available data.
torginus · 1d ago
In contrast to your legally watertight accusations.
skinnymuch · 19h ago
Do you think the whole world follows America's legal system? Western and American exceptionalists...
dghlsakjg · 1d ago
Legal != ethical
pphysch · 1d ago
If an American company did this, it would be "innovative bootstrapping". Yawn.
./llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CPU"
More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1
Was that document almost exclusively written with LLMs? I looked at it last night (~8 hours ago) and it was riddled with mistakes, most egregious was that the "Run with Ollama" section had instructions for how to install Ollama, but then the shell commands were actually running llama.cpp, a mistake probably no human would make.
Do you have any plans on disclosing how much of these docs are written by humans vs not?
Regardless, thanks for the continued release of quants and weights :)
But in the docs I see things like
Wouldn't this explain that? (Didn't look too deep)``` ./llama.cpp/llama-gguf-split --merge \ DeepSeek-V3.1-GGUF/DeepSeek-V3.1-UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \ merged_file.gguf ```
Ollama can only allow merged GGUFs (not splitted ones), so hence the command.
All docs are made by humans (primarily my brother and me), just sometimes there might be some typos (sorry in advance)
I'm also uploading Ollama compatible versions directly so ollama run can work (it'll take a few more hours)
There is a way to convert to Q8_0, BF16, F16 without compiling llama.cpp, and it's enabled if you use `FastModel` and not on `FastLanguageModel`
Essentially I try to do `sudo apt-get` if it fails then `apt-get` and if all fails, it just fails. We need `build-essential cmake curl libcurl4-openssl-dev`
See https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...
Imo it's best to just depend on the required fork of llama.cpp at build time (or not) according to some configuration. Installing things at runtime is nuts (especially if it means modifying the existing install path). But if you don't want to do that, I think this would also be an improvement:
Is either sort of change potentially agreeable enough that you'd be happy to review it?(1) Removed and disabled sudo
(2) Installing via apt-get will ask user's input() for permission
(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp
I would just ask the user to install the package, and _maybe_ show the command line to install it (but never run it).
That said, it does at least seem like these recent changes are a large step in the right direction.
---
* in terms of what the standard approach should be, we live in an imperfect world and package management has been done "wrong" in many ecosystems, but in an ideal world I think the "correct" solution here should be:
(1) If it's an end user tool it should be a self contained binary or it should be a system package installed via the package manager (which will manage any ancillary dependencies for you)
(2) If it's a dev tool (which, if you're cloning a cpp repo & building binaries, it is), it should not touch anything systemwide. Whatsoever.
This often results in a README with manual instructions to install deps, but there are many good automated ways to approach this. E.g. for CPP this is a solved problem with Conan Profiles. However that might incur significant maintenace overhead for the Unsloth guys if it's not something the ggml guys support. A dockerised build is another potential option here, though that would still require the user to have some kind of container engine installed, so still not 100% ideal.
(2) I might make the message on installing llama.cpp maybe more informative - ie instead of re-directing people to the docs on manual compilation ie https://docs.unsloth.ai/basics/troubleshooting-and-faqs#how-..., I might actually print out a longer message in the Python cell entirely
Yes we're working on Docker! https://hub.docker.com/r/unsloth/unsloth
That will be nice too, though I was more just referring to simply doing something along the lines of this in your current build:
(likely mounting & calling a sh file instead of passing individual commands)---
Although I do think getting the ggml guys to support Conan (or monkey patching your own llama conanfile in before building) might be an easier route.
Quietly installing stuff at runtime is shady for sure, but why not if I consent?
1. So I added a `check_llama_cpp` which checks if llama.cpp does exist and it'll use the prebuilt one https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...
2. Yes I like the idea of determining distro
3. Agreed on bailing - I was also thinking if doing a Python input() with a 30 second waiting period for apt-get if that's ok? We tell the user we will apt-get some packages (only if apt exists) (no sudo), and after 30 seconds, it'll just error out
4. I will remove sudo immediately (ie now), and temporarily just do (3)
But more than happy to fix this asap - again sorry on me being dumb
- Determine the command that has to be run by the algorithm above.
This does most of the work a user would have to figure out what has to be installed on their system.
- Ask whether to run the command automatically.
This allows the “software should never install dependencies by itself” crowd to say no and figure out further steps, while allowing people who just want it to work to get on with their task as quickly as possible (who do you think there are more of?).
I think it would be fine to print out the command and force the user to run it themselves, but it would bring little material gain at the cost of some of your users’ peace (“oh no it failed, what is it this time ...”).
Please, please, never silently attempt to mutate the state of my machine, that is not a good practice at all and will break things more often than it will help because you don't know how the machine is set up in the first place.
But yes agreed there won't be any more random package installs sorry!
You just fail and print a nice error message telling the user exactly what they need to do, including the exact apt command or whatever that they need to run.
(1) Removed and disabled sudo
(2) Installing via apt-get will ask user's input() for permission
(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp
Again apologies on my dumbness and thanks for pointing it out!
I was thinking if I can do it during the pip install or via setup.py which will do the apt-get instead.
As a fallback, I'll probably for now remove shell executions and just warn the user
Some people may prefer using whatever llama.cpp in $PATH, it's okay to support that, though I'd say doing so may lead to more confused noob users spam - they may just have an outdated version lurking in $PATH.
Doing so makes unsloth wheel platform-dependent, if this is too much of a burden, then maybe you can just package llama.cpp binary and have it on PyPI, like how scipy guys maintain a https://pypi.org/project/cmake/ on PyPI (yes, you can `pip install cmake`), and then depends on it (maybe in an optional group, I see you already have a lot due to cuda shit).
I'm still working on it, but sadly I'm not a packaging person so progress has been nearly zero :(
From how I interpreted it, he meant you could create a new python package, this would effectively be the binary you need.
In your current package, you could depend on the new one, and through that - pull in the binary.
This would let you easily decouple your package from the binary,too - so it'd be easy to update the binary to latest even without pushing a new version of your original package
I've maintained release pipelines before and handled packaging in a previous job, but I'm not particularly into the python ecosystem, so take this with a grain of salt: an approach would be
Pip Packages :
I was trying to see if I could pre-compile some llama.cpp binaries then save them as a zip file (I'm a noob sorry) - but I definitely need to investigate further on how to do python pip binaries
Try to find prebuilt and download.
See if you can compile from source if a compiler is installed.
If no compiler: prompt to install via sudo apt and explaining why, also give option to abort and have the user install a compiler themselves.
This isn't perfect, but limits the cases where prompting is necessary.
The current solution hopefully is in between - ie sudo is gone, apt-get will run only after the user agrees by pressing enter, and if it fails, it'll tell the user to read docs on installing llama.cpp
Usually you don't make assumptions on the host OS, just try to find the things you need and if not, fail, ideally with good feedback. If you want to provide the "hack", you can still do it, but ideally behind a flag, `allow_installation` or something like that. This is, if you want your code to reach broader audiences.
But I do agree maybe for better security pypi should check for commands and warn
But I'm working on more cross platform docs as well!
I'm working with the AMD folks to make the process easier, but it looks like first I have to move off from pyproject.toml to setup.py (allows building binaries)
nixos is such a great way to expose code doing things it shouldn't be doing.
but when it was failing on my original idea, it kept trying dumb things that weren't really even nix after a while.
You guys already do a lot for the local LLM community and I appreciate it.
165GB will need a 24GB GPU + 141GB of RAM for reasonably fast inference or a Mac
llama.cpp also unfortunately cannot quantize matrices that are not a multiple of 256 (2880)
https://www.tbench.ai/leaderboard
Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.
there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
Example recent one on GPT-5:
https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...
All results:
https://eval.16x.engineer/evals/coding
I don't consider myself super special. I think it should be doable to create a benchmark that beats me having to test every single new model.
"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"
:P
If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.
So no.
They're not secret.
Your api key is linked to your credit card, which is linked to your identity.
…but hey, youre right.
Lets just trust them not to be cheating. Cool.
We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?
No comments yet
https://artificialanalysis.ai/models/deepseek-v3-1-reasoning
Qwen3-Coder-480B-A35B-Instruct
GLM4.5 Air
Kimi K2
DeepSeek V3 0324 / R1 0528
GPT-5 Mini
Thanks for any feedback!
Let's hope not, because gpt-oss-120B can be dramatically moronical. I am guessing the MoE contains some very dumb subnets.
Benchmarks can be a starting point, but you really have to see how the results work for you.
It's knowledge seems to be lacking even compared to gpt3.
No idea how you'd benchmark this though.
That is the point of these small models. Remove the bloat of obscure information (address that with RAG), leaving behind a core “reasoning” skeleton.
Seems more user-friendly to bake it in.
https://openrouter.ai/openai/gpt-oss-120b and https://openrouter.ai/deepseek/deepseek-chat-v3.1 for the same providers is probably better, although gpt-oss-120b has been around long enough to have more providers, and presumably for hosters to get comfortable with it / optimize hosting of it.
Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1
I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.
Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?
llama.cpp probably is too, but I haven't tried it with a bigger model yet.
llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf
And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.
There's also ROCm. That's not working for me in LM Studio at the moment. I used that early last year to get some LLMs and stable diffusion running. As far as I can tell, it was faster before, but Vulkan implementations have caught up or something - so much the mucking about isn't often worth it. I believe ROCm is hit or miss for a lot of people, especially on windows.
or this: ``` <|toolcallsbegin|><|toolcallbegin|>executeshell<|toolsep|>{"command": "pwd && ls -la"}<|toolcallend|><|toolcallsend|> ```
Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.
One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)
I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.
I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.
There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.
Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling
* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params
* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...
For OpenAI, you can just pass in the json_schema to activate it, no library needed. For direct LLM interfacing you will need to host your own LLM or use a cloud provider that allows you too hook in, but someone else may need to correct me on this.
If anyone is using anything other than Outlines, please let us know.
Some people on reddit (very reliable source I know) are saying it was trained on a lot of Gemini and I can see that. for example it does that annoying thing gemini does now where when you use slang or really any informal terms it puts them in quotes in its reply
Haven´t used Gemini much, but the time I used it, it felt very academic and theoretical compared to Opus 4. So that seems to fit. But I'll have to do more evaluation of the non-Claude models to get a better idea of the differences.
It raises a funny "innovator's dilemma" that might happen. Where an incumbent has to serve chatty consumers, and therefore gets little technical/professional training data. And a more sober workplace chatbot provider is able to advance past the incumbent because they have better training data. Or maybe in a more subtle way, chatbot personas give you access to varying market segments, and varying data flywheels.
They have a great Discord community where people share their prompts and workflows.
There is also the WritingWithAI subreddit.
https://github.com/pchalasani/claude-code-tools/tree/main?ta...
I'm not confident enough to say it's as good as claude opus or even sonnet, but it seems not bad!
I did run into an api error when my context exceeded deepseek's 128k window and had to manually compact the context.
I'm not a programmer, though - engineering manager.
While system promting is the easy way of limiting the output in a somewhat predictable manner, have you tried setting `max_tokens` when doing inference? For me that works very well for constraining the output, if you set it to 100 you get very short answers while if you set it to 10,000 you can very long responses.
I asked Kimi K2 for an account of growing up in my home town in Scotland, and it was ridiculously accurate. I then asked it to do the same for a similarly sized town in Kerala. ChatGPT suggested that while it was a good approximation, K2 got some of the specifics wrong.
$0.56 per million tokens in — and $1.68 per million tokens out.
But I am sti disappointed with the price increase.
*pricing: MODEL deepseek-chat deepseek-reasoner 1M INPUT TOKENS (CACHE HIT) $0.07 1M INPUT TOKENS (CACHE MISS) $0.56 1M OUTPUT TOKENS $1.68
https://brokk.ai/power-ranking?version=openround-2025-08-20&...
some Chinese next gen AI chips is coming, deepseek is working with them to get its flagship model trained using such domestic chips.
interesting time ahead! just imagine what it could do to NVIDIA share price when deepseek releases a SOTA new model trained without using NVIDIA chips.
Anyway, from my experience, if China really has advanced AI chips for SOTA model, I am sure propaganda machine will go all out, look how they boasted Huawei CPU that’s two generations behind Qualcomm and TSMC
I'll have to see how things go with this model after a week, once the hype has died down.
It's theft all the way down, son
Drawing the line only now... saying this is where copying stops being okay doesn't seem very fair. No AI company is really in a position to whine about it from my POV (ignoring any lawyer POV). Cue the world's smallest violin
BTW DeepSeek has contributed a lot, with actual white papers describing in detail their optimizations. How are the rest of the American AI labs doing in contributing research and helping one another advance the field?