Show HN: Cactus – Ollama for Smartphones
Ollama enables deploying LLMs models locally on laptops and edge severs, Cactus enables deploying on phones. Deploying directly on phones facilitates building AI apps and agents capable of phone use without breaking privacy, supports real-time inference with no latency, we have seen personalised RAG pipelines for users and more.
Apple and Google actively went into local AI models recently with the launch of Apple Foundation Frameworks and Google AI Edge respectively. However, both are platform-specific and only support specific models from the company. To this end, Cactus:
- Is available in Flutter, React-Native & Kotlin Multi-platform for cross-platform developers, since most apps are built with these today.
- Supports any GGUF model you can find on Huggingface; Qwen, Gemma, Llama, DeepSeek, Phi, Mistral, SmolLM, SmolVLM, InternVLM, Jan Nano etc.
- Accommodates from FP32 to as low as 2-bit quantized models, for better efficiency and less device strain.
- Have MCP tool-calls to make them performant, truly helpful (set reminder, gallery search, reply messages) and more.
- Fallback to big cloud models for complex, constrained or large-context tasks, ensuring robustness and high availability.
It's completely open source. Would love to have more people try it out and tell us how to make it great!
Is this really true? Where are these stats coming from?
We are a dev toolkit to run LLMs cross-platform locally in any app you like.
With respect to the inference SDK, yes you'll need to install the (react native/flutter) framework inside each app you're building.
The SDK is very lightweight (our own iOS app is <30MB which includes the inference SDK and a ton of other stuff)
It would be great if the local llm have access to local tools you can enable/disable as needed (e.g. via customizable profiles). Simple tools such as fetch url, file access, messaging, calendar, etc would be very useful, though I'm not sure if the input token limit is large enough to allow this. Even better if it can somehow do web search but I understand it would be hard to do for free.
Also, how cool it would be if you can expose openai compatible api that can be accessed from other devices in your local network? Imagine turning your old phones into local llm servers. That would be very cool.
By the way, I can't figure out how to clear previous chats data. Is it hidden somewhere?
to your previous point - Cactus fully supports tool calling (for models that have been instruction-trained accordingly, e.g. Qwen 1.7B)
for "turning your old phones into local llm servers", Cactus is likely not the best tool. We'd recommend something like actual Ollama or Exo
Thank you especially for the phone model vs tok/s breakdown. Do you have such tables for more models? For models even leaner than Gemma3 1B. How low can you go? Say if I wanted to tweak out 45toks/s on an iPhone 13?
P.S: Also, I'm assuming the speeds stay consistent with react-native vs. flutter etc?
A Qwen 2.5 500M will get you to ≈45tok/sec on an iPhone 13. Inference speeds are somewhat linearly inversely proportional to model sizes.
Yes, speeds are consistent across frameworks, although (and don't quote me on this), I believe React Native is slightly slower because it interfaces with the C++ engine through a set of bridges.
When I was working with RAG llama.cpp through RN early last year I had pretty acceptable tok/sec results up through 7-8b quantized models (on phones like the S24+ and iPhone 15pro). MLC was definitely higher tok/sec but it is really tough to beat the community support and availability in the gguf ecosystem.
Most of the standard mobile CPU benchmarks (GeekBench, AnTuTu, et al) show a 20-40% performance gain over S23/S24 Ultra. Also, this bucks the trend where most other devices are ranked appropriately (i.e. newer devices perform better).
Thanks for sharing your project.
S25 is an outlier that surprised us too.
I got $10 on S25 climbing back up to the top of the rankings as more data comes in :)
Both the model and the app only have access to the tools or data that you choose to give it. If you choose to give the model access to web search - sure, it'll have (read-only) access to internet data.
No comments yet
1. The lack of a dark mode is an accessibility issue for me. I have a genetic condition that causes severe light sensitivity and special difficulty with contrast. The only way for me to achieve sufficient contrast without uncomfortable and blinding brightness is dark mode, so at present I can only use your app by disabling dark mode and inverting colors across my phone. This is of course not ideal because it ruins photos in other apps, and I end up with some unavoidable very bright white "hotspots" on my phone that I don't normally have when I can just use dark mode. Relatedly, the contrast for some of the text in the app is low to the point of being practically unreadable for me (getting enough contrast with it similarly requires cranking up the brightness). :(
2. I tried downloading a few other models, namely Jan Nano and SmolLM3, using the GGUF link download functionality, but every time I select them, the app just immediately crashes.
I understand that the chat app on the Play Store is basically just a demo for the framework, and if I were really using it I would be in charge of my own theming and downloading the required models and so on, but these still seem worth fixing to me.
Would be great to have a few larger models to choose from too, Qwen 3 4b, 8b etc
Adding shortly!
We are working on agentic browser (also launched today https://news.ycombinator.com/item?id=44523409 :))
Right now we have a desktop version with ollama support, but we want to build a mobile chromium fork with local LLM support. Will check out cactus!
DM me on BF - let's talk!
can you tell us more about the use cases that you have in mind? I saw that you're able to run 1-4B models (which is impressive!)
We're currently working with a few projects in the space.
For a demo of a familiar chat interface, download https://apps.apple.com/gb/app/cactus-chat/id6744444212 or https://play.google.com/store/apps/details?id=com.rshemetsub...
For other applications, join the discord and stay tuned! :)
The performance is quite good, even on CPU.
However I'm now trying it on a pixel, and it's not using GPU if I enable it.
I do like this idea as I've been running models in termux until now.
Is the plan to make this app something similar to lmstudio for phones?
Some Android models won't support GPU hardware; we'll be addressing that as we move to our own kernels.
The app itself is just a demonstration of Cactus performance. The underlying framework gives you the tools to build any local mobile AI experience you'd like.
I believe there are some frameworks pioneering model encryption, but i think we're a few steps away from wide adoption.
This isn't really anything novel to LLMs of AI models. Part of the reason for many previously desktop applications being cloud or requiring cloud access is keeping their sensitive IP off the end users' device.
It is fantastic. Compared to another program I had installed a year ago, the speed of processing and answering is really good and accurate. Was able to ask mathematical questions, basic translation between different languages and even trivia about movies released almost 30 years ago.
Things to improve: 1) sometimes the question would get stuck on the last phrase and keep repeating it without end. 2) The chat does not scroll the window to follow the answer and we have to scroll manually.
In either case, excellent start. It is without the fastest offline LLM that I've seen working on this phone.
re: "question would get stuck on the last phrase and keep repeating it without end." - that's a limitation of the model i'm afraid. Smaller models tend to do that sometimes.
No comments yet
Most projects typically start with llama.cpp and then move away to proprietary kernels
we support cloud fallback as an add-on feature. This lets us support vision and audio in addition to text.
https://github.com/cactus-compute/cactus/tree/main/react#emb...
(Flutter works the same way)
What are you building?
You’d want an API for downloading OR pulling from a cache. Return an identifier from that and plug it into the inference API.
We're restructuring the model initialization API to point to a local file & exposing a separate abstracted download function that takes in a URL.
wrt downloading post-install: based on our feedback, this is indeed a preferred pattern (as opposed to bundling in large files).
We'll update the download API, thanks again.
https://play.google.com/store/apps/details?id=com.rshemetsub...
Otherwise, it's easy to build any of the example apps from the repo:
cd react/example && yarn && npx expo run:android
or
cd flutter/example && flutter pub get && flutter run
> Why lie?
Whoa—that's way too aggressive for this forum and definitely against the site guidelines. Could you please review them (https://news.ycombinator.com/newsguidelines.html) and take the spirit of this site more to heart? We'd appreciate it. You can always make your substantive points while doing that.
Note this one: "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."
The core distinction is in the ecosystem: Google AI Edge runs tflite models, whereas Cactus is built for GGUF. This is a critical difference for developers who want to use the latest open-source models.
One major outcome of this is model availability. New open source models are released in GGUF format almost immediately. Finding or reliably converting them to tflite is often a pain. With Cactus, you can run new GGUF models on the day they drop on Huggingface.
Quantization level also plays a role. GGUF has mature support for quantization far below 8-bit. This is effectively essential for mobile. Sub-8-bit support in TFLite is still highly experimental and not broadly applicable.
Last, Cactus excels at CPU inference. While tflite is great, its peak performance often relies on specific hardware accelerators (GPUs, DSPs). GGUF is designed for exceptional performance on standard CPUs, offering a more consistent baseline across the wide variety of devices that app developers have to support.
GGUF is more suitable for the latest open-source models, i agree there. Quant2/Q4 will probably be critical as well, if we don't see a jump in ram. But then again I wonder when/If mediapipe will support GGUF as well.
PS, I see you are in the latest YC batch? (below you mentioned BF). Good luck and have fun!
I have not looked at OP's work yet, but if it makes the task easier, I would opt for that instead of Google's "MediaPipe" API.
1) The commit history goes back to April.
2) LlaMa.cpp licence is included in the Repo where necessary like Ollama, until it is deprecated.
3) Flutter isolates behave like servers, and Cactus codes use that.
Phones are resource-constrained, we saw significant battery overhead with in-process HTTP listeners so we stuck with simple stateful isolates in Flutter and exploring standalone server app others can talk to for React.
For model sharing with the current setup:
iOS - We are working towards writing the model into an App Group container, tricky but working around it.
Android - We are working towards prompting the user once for a SAF directory (e.g., /Download/llm_models), save the model there, then publish a ContentProvider URI for zero-copy reads.
We are already writing more mobile-friendly kernels and Tensors, but GGML/GGUF is widely supported, porting it is an easy way to get started and collect feedback, but we will completely move away from in < 2 months.
Anything else you would like to know?
How does writing a model into a shared directory on Android enable a local LLM server that 3rd party apps can make calls to?[^2]
How does writing your own kernels get you off GGUF in 2 months? GGUF is a storage format. You use kernels to do things with the numbers you get from it.
I thought GGUF was an advantage? Now it's something you're basically done using?
I don't think you should continue this conversation. As easy it as it is to get your work out there, it's just as easy to build a record of stretching truth over and over again.
Best of luck, and I mean it. Just, memento mori: be honest and humble along the way. This is something you will look back on in a year and grimace.
[^1] App group containers only work between apps signed from the same Apple developer account. Additionally, that is shared storage, not a way to provide APIs to other apps.
[^2] SAF = Storage Access Framework, that is shared storage, not a way to provide APIs to other apps.
Not staying professional and just answering the questions, and just doing "aight im outta here" when it gets a little bit harder is not a good look; it seems like you can't defend your own project.
Just FYI.
- "You are, undoubtedly, the worst pirate i have ever heard of" - "Ah, but you have heard of me"
Yes, we are indeed a young project. Not two weeks, but a couple of months. Welcome to AI, most projects are young :)
Yes, we are wrapping llama.cpp. For now. Ollama too began wrapping llama.cpp. That is the mission of open-source software - to enable the community to build on each others' progress.
We're enabling the first cross-platform in-app inference experience for GGUF models and we're soon shipping our own inference kernels fully optimized for mobile to speed up the performance. Stay tuned.
PS - we're up to good (source: trust us)