Ollama and gguf

73 indigodaddy 31 8/11/2025, 5:54:08 PM github.com ↗

Comments (31)

tarruda · 2h ago

I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).

Even using llama.cpp as a library seems like an overkill for most use cases. Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it.

One thing I'm curious about: Does ollama support strict structured output or strict tool calls adhering to a json schema? Because it would be insane to rely on a server for agentic use unless your server can guarantee the model will only produce valid json. AFAIK this feature is implemented by llama.cpp, which they no longer use.

hodgehog11 · 1h ago

I got to speak with some of the leads at Ollama and asked more or less this same question. The reason they abandoned llama.cpp is because it does not align with their goals.

llama.cpp is designed to rapidly adopt research-level optimisations and features, but the downside is that reported speeds change all the time (sometimes faster, sometimes slower) and things break really often. You can't hope to establish contracts with simultaneous releases if there is no guarantee the model will even function.

By reimplementing this layer, Ollama gets to enjoy a kind of LTS status that their partners rely on. It won't be as feature-rich, and definitely won't be as fast, but that's not their goal.

jychang · 1h ago

That's a dumb answer from them.

What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel? Like every linux distro ever who's ever ran into this issue?

Red Hat doesn't ship the latest build of the linux kernel to production. And Red Hat didn't reinvent the linux kernel for shits and giggles.

hodgehog11 · 1h ago

The Linux kernel does not break userspace.

> What's wrong with using an older well-tested build of llama.cpp, instead of reinventing the wheel?

Yeah, they tried this, this was the old setup as I understand it. But every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.

tarruda · 34m ago

> every time they needed support for a new model and had to update llama.cpp, an old model would break and one of their partners would go ape on them. They said it happened more than once, but one particular case (wish I could remember what it was) was so bad they felt they had no choice but to reimplement. It's the lowest risk strategy.

A much lower risk strategy would be using multiple versions of llama-server to keep supporting old models that would break on newer llama.cpp versions.

MarkSweep · 14m ago

The Ollama distribution size is already pretty big (at least on Windows) due to all the GPU support libraries and whatnot. Having to multiple that by the number of llama.cpp versions supported would not be great.

refulgentis · 10m ago

This is a good handwave-y answer for them but truth is they've always been allergic to ever mentioning llama.cpp, even when legally required, they made a political decision instead of an engineering one, and now justify it to themselves and you by handwaving about it somehow being less stable than the core of it, which they still depend on.

A lot of things happened to get to the point they're getting called out aggressively in public on their own repo by nice people, and I hope people don't misread a weak excuse made in conversation as solid rationale, based on innuendo. llama.cpp has been just fine for me.

ozim · 34m ago

Feels like BS I guess wrapping 2 or even more versions should not be that much of a problem.

There was drama that ollama doesn’t credit llama.cpp and most likely crediting it was „not aligning with their goals”.

A4ET8a8uTh0_v2 · 1h ago

Thank you. This is genuinely a valid reason even from a simple consistency perspective.

(edit: I think -- after I read some of the links -- I understand why Ollama comes across as less of a hero. Still, I am giving them some benefit of the doubt since they made local models very accessible to plebs like me; and maybe I can graduate to no ollama )

hodgehog11 · 57m ago

I think this is the thing: if you can use llama.cpp, you probably shouldn't use Ollama. It's designed for the beginner.

arcanemachiner · 1h ago

> I recently discovered that ollama no longer uses llama.cpp as a library, and instead they link to the low level library (ggml) which requires them to reinvent a lot of wheel for absolutely no benefit (if there's some benefit I'm missing, please let me know).

Here is some relevant drama on the subject:

https://github.com/ollama/ollama/issues/11714#issuecomment-3...

cdoern · 1h ago

> Ollama could make its life much easier by spawning llama-server as a subprocess listening on a unix socket, and forward requests to it

I'd recommend taking a look at https://github.com/containers/ramalama its more similar to what you're describing in the way it uses llama-server, also it is container native by default which is nice for portability.

wubrr · 1h ago

> Does ollama support strict structured output or strict tool calls adhering to a json schema?

As far as I understand this is generally not possible at the model level. Best you can do is wrap the call in a (non-llm) json schema validator, and emit an error json in case the llm output does not match the schema, which is what some APIs do for you, but not very complicated to do yourself.

Someone correct me if I'm wrong

mangoman · 1h ago

no that's incorrect - llama.cpp has support for providing a context free grammar while sampling and only samples tokens that would conform to the grammar, rather than sampling tokens that would violate the grammar

tarruda · 1h ago

The inference engine (llama.CPP) has full control over the possible tokens during inference. It can "force" the llm to output only valid tokens so that it produces valid json

kristjansson · 55m ago

and in fact leverages that control to constrain outputs to those matching user-specified BNFs

https://github.com/ggml-org/llama.cpp/tree/master/grammars

halyconWays · 1h ago

>(if there's some benefit I'm missing, please let me know).

Makes their VCs think they're doing more, and have more ownership, rather than being a do-nothing wrapper with some analytics and S3 buckets that rehost models from HF.

indigodaddy · 5h ago

ggerganov explains the issue: https://github.com/ollama/ollama/issues/11714#issuecomment-3...

scosman · 32m ago

This is the comment people should read. GG is amazing.

Ollama forked to get it working for day 1 compatibility. They need to get their system back in line with mainline because of that choice. That's kinda how open source works.

The uproar over this (mostly on reddit and x) seems unwarranted. New models regularly have compatibility issues for much longer than this.

ekianjo · 18m ago

GG clearly mentioned they did not contribute anything to upstream.

polotics · 2h ago

ggerganov is my hero, and... it's a good thing this got posted so I saw in the comments that --flash-attn --cache-reuse 256 could help with my setup (M3 36GB + RPC to M1 16GB) figuring out what params to set and at what value is a lot of trial and error, Gemini does help a bit clarify what params like top-k are going to do in practice. Still the whole load-balancing with RPC is something I think I'm going to have to read the source of llama.cpp to really understand (oops I almost wrote grok, damn you Elon) Anyways ollama is still not doing distributed load, and yeah I guess using it is a stepping stone...

magicalhippo · 5h ago

I noticed it the other way, llama.cpp failed to download the Ollama-downloaded gpt-oss 20b model. Thought it was odd given all the others I tried worked fine.

Figured it had to be Ollama doing Ollama things, seems that was indeed the case.

LeoPanthera · 2h ago

The named anchor in this URL doesn't work in Safari. Safari correctly scrolls down to the comment in question, but then some Javascript on the page throws you back up to the top again.

llmthrowaway · 2h ago

Confusing title - thought this was about Ollama finally supporting sharded GGUF (ie. the Huggingface default for large gguf over 48gb).

https://github.com/ollama/ollama/issues/5245

Sadly it is not and the issue still remains open after over a year meaning ollama cannot run the latest SOTA open source models unless they covert them to their proprietary format which they do not consistently do.

No surprise I guess given they've taken VC money, refuse to properly attribute the use things like llama.cpp and ggml, have their own model format for.. reasons? and have over 1800 open issues...

Llama-server, ramallama or whatever model switcher ggerganov is working on (he showed previews recently) feel like the way forward.

dcreater · 4h ago

I think the title buries the lede? Its specific to GPT-OSS and exposes the shady stuff Ollama is doing to acquiesce/curry favor/partner with/get paid by corporate interests

freedomben · 3h ago

I think "shady" is a little too harsh - sounds like they forked an important upstream project, made incompatible changes that they didn't push upstream or even communicate with upstream about, and now have to deal with the consequences of that. If that's "shady" (despite being all out in the open) then nearly every company I've worked for has been "shady."

wsgeorge · 2h ago

There's a reddit thread from a few months ago that sort of explains what people don't like about ollama, that "shadiness" parent references:

https://www.reddit.com/r/LocalLLaMA/comments/1jzocoo/finally...

om8 · 22m ago

llama.cpp is a mess and ollama is right to move on from it

12345hn6789 · 2h ago

Just days ago ollama devs claimed[0] that ollama no longer relies on ggml / llama.cpp. here is their pull request(+165,966 −47,980) to reimplement (copy) llama.cpp code in their repository.

https://github.com/ollama/ollama/pull/11823

[0] https://news.ycombinator.com/item?id=44802414#44805396

flakiness · 1h ago

not against overall sentiment here, but quote the counterpoint from the linked HN comment to be fair:

> Ollama does not use llama.cpp anymore; we do still keep it and occasionally update it to remain compatible for older models for when we used it.

The linked PR is doing "occasionally update it" I guess? Note that "vendored" in the PR title often means to take a snapshot to pin a specific version.

ekianjo · 14m ago

gpt-oss is not an "older model"

Ask HN: Why Is My Happiness Tied to My Productivity?

Snapchat open source cross-platform mobile framework. Looking for beta testers

Ask HN: Do you do anything with the "cool" languages that get posted here?

We'll need a universal basic income (UBI) in an AI-driven world

GitHub Outage?

Tell HN: Regulations.gov Comments API is shutting down on Friday

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

Google's RCS disconnected in several countries

Ask HN: What toolchains are people using for desktop app development in 2025?

Ask HN: With all the AI hype, how are software engineers feeling?

Ask HN: What trick of the trade took you too long to learn?

Ask HN: What tech skill gave you the biggest boost in your career?

Comparing 6M Feature Selection Methods on Credit Risk Data

Tell HN: Anthropic expires paid credits after a year

Ask HN: Has anyone built anything useful using AI?

Vectorless: open-source PDF chatbot without RAG

Ask HN: Has any of the Pivotal Tracker replacement attempts succeeded?

Ask HN: What are some comfy/stress-free jobs a SWE can do? (LCOL country)

What's your favorite CLI tool for integrating LLMs into your terminal workflow?

Ask HN: Canadian founders, how do you build in SF?

Ask HN: Advice for someone who wants to try AI-assisted coding?

Does anyone know a detailed residential cost estimator

Ask HN: Why is Usenet not coming back?

Ask HN: Best way to get a land line for my kids?

Ask HN: What do you dislike about ChatGPT and what needs improving?

Ask HN: What's Going on with AI Psychosis?

Feature Request: "Copy" Button Should Copy Only Main Output

GPT5 is worse than 4.1-mini for text and worse than Sonnet 4 for coding

ChatGPT 5 is slow and no better than 4

Ask HN: How would you build second brain in the AI era?

Ask HN: In which programming language is it better to make your own language?

Ask HN: Do you think differently about working on open source these days?

ChatGPT-5 Can't Do Basic Math

Tell HN: Charles Irby has passed away

GPT-5 streaming requires submission of biometric data

Ask HN: Are you running local LLMs? What are your key use cases?

Ask HN: OpenAI GPT-5 API seems to be significantly slower – is this expected?

Ollama and gguf

Comments (31)