Gemma 3n preview: Mobile-first AI

138 meetpateltech 63 5/20/2025, 6:03:32 PM developers.googleblog.com ↗

Comments (63)

nolist_policy · 2h ago
You can try it on Android right now:

Download the Edge Gallery apk from github: https://github.com/google-ai-edge/gallery/releases/tag/1.0.0

Download one of the .task files from huggingface: https://huggingface.co/collections/google/gemma-3n-preview-6...

Import the .task file in Edge Gallery with the + bottom right.

You can take pictures right from the app. The model is indeed pretty fast.

nolist_policy · 1h ago
Okay from some first tries with story writing, gemma-3n-E4B-it seems to perform between plain Gemma 3 4B and 12B. It definitely retains the strong instruction following which is good.

Hint: You have to set the Max tokens to 32000 for longer conversations. The slider makes it look like it's limited to 1024, just enter it manually.

lousken · 57m ago
waiting for approval, is there a magnet?
onlyrealcuzzo · 3h ago
Probably a better link: https://developers.googleblog.com/en/introducing-gemma-3n/

Gemma 3n is a model utilizing Per-Layer Embeddings to achieve an on-device memory footprint of a 2-4B parameter model.

At the same time, it performs nearly as well as Claude 3.7 Sonnet in Chatbot Arena.

Deathmax · 3h ago
It's not a 4B parameter model. The E4B variant is 7B parameters with 4B loaded into memory when using per-layer embedding cached to fast storage, and without vision or audio support.
zamadatix · 2h ago
The link says E2B and E4B have 4B and 8B raw parameters, where do you see 7B?
ai-christianson · 3h ago
That seems way too good to be true.

What's the catch?

Vuizur · 3h ago
It is not very good at hard tasks, its ranking is much worse there.
refulgentis · 2h ago
I used to defend LMSys/Chatbot Arena a lot but threw in the towel after events of the past three months.

I can give more details if you (or anyone else!) is interested.

TL;DR: it is scoring only for "How authoritative did the answer look? How much flattering & emojis?"

Jowsey · 2h ago
Is this not what Style Control (which IIRC they're making default soon) aims to mitigate?
refulgentis · 1h ago
I'm not 100% sure what their rationale is for it, the launch version of style control was a statistical model that penalized a few (4?) markdown shibboleths (lists, headers, ?).

Not sure if they've shared more since.

IMVHO it won't help, at all, even if they trained a perfect model that could accurately penalize it*

The main problem is its one off responses, A/B tested. There's no way to connect it into all the stuff we're using to do work these days (i.e. tools / MCP servers), so at this point its sort of skipping the hard problems we'd want to see graded.

(this situation is a example: whats more likely, style control is a small idea for an intractable problem, or Google has now released multiple free models better than Sonnet, including the latest, only 4B params?

To my frustration, I have to go and bench these things myself because I have an AI-agnostic app I build, but I can confirm it is not the case that Gemma 3-not-n is better than Sonnet. 12B can half-consistently make file edits, which is a major step forward for local tbh)

* I'm not sure how, "correctness" is a confounding metric here: we're probably much more likely to describe a formatted answer in negative terms if the answer is incorrect.

In this case I am also setting aside how that could be done, just saying it as an illustration of no matter what, it's the wrong platform for a "how intelligent is this model?" signal, at this point, post-Eliza post-Turing, couple years out from ChatGPT 1.0

esafak · 3h ago
Imagine a model smarter than most humans that fits on your phone.

edit: I seem to be the only one excited by the possibilities of such small yet powerful models. This is an iPhone moment: a computer that fits in your pocket, except this time it's smart.

rhdjsjebshjffn · 2h ago
I can't speak for anyone else, but these models only seem about as smart as google search, with enormous variability. I can't say I've ever had an interaction with a chatbot that's anything redolent of interaction with intelligence.

Now would I take AI as a trivia partner? Absolutely. But that's not really the same as what I look for in "smart" humans.

sureglymop · 23m ago
The image description capabilities are pretty insane, crazy to think it's all happening on my phone. I can only imagine how interesting this is accessibility wise, e.g. for vision impaired people. I believe there are many more possible applications for these on a smartphone than just chatting with them.
hmapple · 42m ago
Have you tried any SOTA models like o3?

If not, I strongly encourage you to discuss your area of expertise with it and rate based on that

It is incredibly competent

koakuma-chan · 22m ago
I have a girlfriend. She is a model. A large language model.
codr7 · 3h ago
intelligence != memory
esafak · 3h ago
ML is not memorization. Besides, how much memory do you think this model has?
codr7 · 2h ago
I know, it's worse.
TeMPOraL · 2h ago
It's understanding.
croes · 12m ago
LLMs neither understand nor reason, that has been shown multiple times.
rhdjsjebshjffn · 2h ago
Sure, if you still think the word has meaning.
TeMPOraL · 2h ago
Yes, I do. Any way you slice this term, it looks close to what ML models are learning through training.

I'd go as far as saying LLMs are meaning made incarnate - that huge tensor of floats represents a stupidly high-dimensional latent space, which encodes semantic similarity of every token, and combinations of tokens (up to a limit). That's as close as reifying the meaning of "meaning" itself as we ever come.

(It's funny that we got there through brute force instead of developing philosophy, and it's also nice that we get a computational artifact out of it that we can poke and study, instead of incomprehensible and mostly bogus theories.)

rhdjsjebshjffn · 2h ago
Eh it's no so surprising that our neuroticism produced further neuroticism. I rather expect to watch a grand ranking of poetry next week....

To anyone who questions why we might produce such a machine, I ask them to kill themselves out of pity for myself so that I am not obligated to perform such a task.

rhdjsjebshjffn · 2h ago
ML is a kind of memorization, though.
onlyrealcuzzo · 2h ago
Anything can be a kind of something since that's subjective...
croes · 11m ago
But it’s more kind of memorization than understanding and reasoning
goatlover · 2h ago
Why are we imagining? That leads to technologies being overhyped.
IceWreck · 2h ago
According to the readme here - https://huggingface.co/google/gemma-3n-E4B-it-litert-preview

E4B has a score of 44.4 in the Aider polyglot dashboard. Which means its on-par with gemini-2.5-flash (not the latest preview but the version used for the bench on aider's website), gpt4o and gpt4.5.

Thats sounds very good - imagine what a coding focused version of this could do if this is a "generic" embedded only model.

On the other hand - this does have a much lower score for livecodebench.

nolist_policy · 1h ago
Hmm, the Aider polyglot benchmark has been removed from the huggingface readme.

Also:

> These models were evaluated at full precision (float32)

For 4B effective parameters that's 16 GB ram.

krackers · 3h ago
What is "Per Layer Embeddings"? The only hit I can find for that term is the announcement blogpost.

And for that matter, what is

>mix’n’match capability in Gemma 3n to dynamically create submodels

It seems like mixture-of-experts taken to the extreme, where you actually create an entire submodel instead of routing per token?

onlyrealcuzzo · 3h ago
https://ai.google.dev/gemma/docs/gemma-3n#parameters

> Gemma 3n models are listed with parameter counts, such as E2B and E4B, that are lower than the total number of parameters contained in the models. The E prefix indicates these models can operate with a reduced set of Effective parameters. This reduced parameter operation can be achieved using the flexible parameter technology built into Gemma 3n models to help them run efficiently on lower resource devices.

> The parameters in Gemma 3n models are divided into 4 main groups: text, visual, audio, and per-layer embedding (PLE) parameters. With standard execution of the E2B model, over 5 billion parameters are loaded when executing the model. However, using parameter skipping and PLE caching techniques, this model can be operated with an effective memory load of just under 2 billion (1.91B) parameters, as illustrated in Figure 1.

krackers · 3h ago
Thank you, that helped a bit, although it's still not clear what exactly those parameters _are_. "Per-Layer Embedding (PLE) parameters that are used during model execution to create data that enhances the performance of each model layer." is too vague, and I can't find any other reference to "per-layer embedding parameters" in literature.
kcorbitt · 1h ago
I wonder if they've trained the model to operate with a shallower stack; eg. the full model may be composed of 24 transformer blocks, but they've also trained it to accept embeddings at layer 8, so it can be operated with just 16 transformer blocks on lower-resourced devices.

Experimenters in the open source tinkering community have done the opposite (copy/pasting layers in existing models to make them deeper) and it seems to work... fine, with minimal post-training on the new, deeper model required to exceed the performance of the original model. So it's not a crazy idea.

liuliu · 2h ago
Thanks. It is a bit vague to me too. If you need to load 5B per token generation any way, what's that different from selective offloading technique where some MLP weights offloaded to fast storage and loaded during each token generation?
onlyrealcuzzo · 3h ago
A layer is a transformer block / layer (basically the building block of the modern LLM architectures) - maybe Gemini can help you:

https://gemini.google.com/share/cc58a7c6089e

krackers · 2h ago
I am perfectly aware of that. I don't believe other LLMs have such embeddings per layer, only the usual weights, so these per-layer embeddings seem to be distinguished from weights in some way. Afaik trying to play the same "cache in fast storage and load on demand" wouldn't work with layer weights since you'd end up with too much back/forth (you'd touch every cached byte on each token, assuming no MoE), so I'm guessing these embeddings are structured in a way that's broken up by concept.
stygiansonic · 2h ago
From the article it appears to be something they invented:

> Gemma 3n leverages a Google DeepMind innovation called Per-Layer Embeddings (PLE) that delivers a significant reduction in RAM usage.

Like you I’m also interested in the architectural details. We can speculate but we’ll probably need to wait for some sort of paper to get the details.

ankit219 · 2h ago
You can read this for a comprehensive deep dive. https://arxiv.org/pdf/2502.01637

At a very high level, instead of having embeddings at the input layers, this method keeps the embeddings at the layer level. That is every transformer layer would have its own set of learnable embedding vectors that are used to modify the processed hidden states flowing through the network. Mostly, the embeddings are precomputed and stored separately. They are queried at inference time and has very low latency, so you can get comparable performance with half the RAM. (i am not exactly sure how 3n is doing it, but talking it in a general sense).

yorwba · 1h ago
The paper you link to is about a different way to create embeddings at the input layer. In no way does it match your claimed description.
ankit219 · 1h ago
I simplified what i wrote. There is an off accelerator memory where the embeddings are stored and queried at inference time, i did not want to get into details. That is how you reduce the in memory RAM. There are definitely more things going on in the paper as it builds upon the concept I described. The central idea remains the same: you have input embedding layers which map text to continuous vectors. Instead of loading all these layers at runtime, you can break it per layer at training time, and then fetch the required ones from a separate store during inference. Would not be in RAM. Per layer is not mentioned in the paper. But surely it's not a great leap from the paper itself?
andy12_ · 2h ago
I think that it's a poorly named reference to this paper [1] that they mention in the blogpost. If I had to give it another more descriptive name, I would probably name it "Per-Layer Embedding Dimensionality"

[1] https://arxiv.org/pdf/2310.07707

yorwba · 2h ago
The MatFormer is clearly called out as a different aspect of the model design.

PLE is much more likely to be a reference to the Per-Layer Embeddings paper that will be published in the future once it doesn't give away any secret sauce anymore.

andy12_ · 1h ago
I thought the same, but Per-Layer Embeddings as a name doesn't make sense in any context, and MatFormer does exactly what the blogpost says PLE does. I just think it's more probable that the blogpost was written by several authors and that noone bothered to check the final result.
HarHarVeryFunny · 2h ago
Per layer LoRA adapters, perhaps? - same as Apple is using for on-device AI.
ljosifov · 2h ago
On Hugging face I see 4B and 2B versions now -

https://huggingface.co/collections/google/gemma-3n-preview-6...

Gemma 3n Preview

google/gemma-3n-E4B-it-litert-preview

google/gemma-3n-E2B-it-litert-preview

Interesting, hope it comes on LMStudio as MLX or GGUF. Sparse and or MoE models make a difference when running on localhost. MoE Qwen3-30B-A3B most recent game changer for me. Activating only 3b weights on the gpu cores of sparse Qwen3-30B-A3B, rather than comparable ~30b of dense models (Qwen3-32B, Gemma3-27b, GLM-{4,Z1}-32B, older QwQ-32B), is a huge speedup for me: MoE A3B achieves 20-60 tps on my oldish M2 in LMStudio, versus only 4-5 tps for the dense models.

Looking forward to trying gemma-3n. Kudos to Google for open sourcing their Gemmas. Would not have predicted that the lab with "open" in the name has yet to release even v1 (atm at 0; disregarding gpt-2), while other labs, more commercial labs, are are at versions 3, 4 etc already.

impure · 1h ago
Interesting that they reduced the memory usage by half. This would address what is IMO the biggest problem with local LLMs: the limited number of parameters resulting in answers that are not very good.

Also it's funny that they are saying that Llama 4 Maverick performs about the same as GPT-4.1 Nano.

quaintdev · 1h ago
> Gemma 3n enables you to start building on this foundation that will come to major platforms such as Android and Chrome.

Seems like we will not be able to run this with Llama and friends.

https://developers.googleblog.com/en/introducing-gemma-3n/

viraptor · 1h ago
What makes you say that? The files can be downloaded, so it will be done. (Maybe the licence will be an issue)
barnas2 · 2h ago
Is anyone able to test it via AiStudio? I pay for Google's AI subscription, but any attempt to use this model results in a message telling me I've hit my rate limit.
sureglymop · 21m ago
Tested it on my Android phone with Google Edge Gallery. No sign up required although a hugging face login is required to download the models in order to import them into the app.
lxgr · 2h ago
Same here.

I've also seemingly hit a rate limit on Gemini Pro 2.5 (on an account not subscribed to Gemini Advanced) yesterday, even though my last query is weeks past.

Possibly there's a capacity shortage (I'd presume it all runs on the same Google hardware in the end), and they are prioritizing paid inference?

DonHopkins · 2h ago
If you're paying enough per month you can upgrade your keys to a higher tier:

https://aistudio.google.com/app/apikey

lxgr · 3h ago
On one hand, it's pretty impressive what's possible with these small models (I've been using them on my phone and computer for a while now).

On the other hand, I'm really not looking forward to app sizes ballooning even more – there's no reasonable way to share them across apps at least on iOS, and I can absolutely imagine random corporate apps to start including LLMs, just because it's possible.

onlyrealcuzzo · 3h ago
That sounds like a problem iOS will eventually deal with, as many apps are going to want this technology, and since Apple distributes apps - they aren't interested in the average app being 10x larger when they could solve the problem easily.

Though, I won't be surprised if they try to force devs to use their models for "privacy" (and not monopolistic reasons, of course).

lxgr · 2h ago
Given Apple's track record in dealing with the problem of ballooning app sizes, I'm not holding my breath. The incentives are just not aligned – Apple earns $$$ on each GB of extra storage users have to buy.
elpakal · 19m ago
I don't know how true your comment is about them earning money on each GB, but if you're interested in app size analysis on iOS I made this for that reason https://dotipa.app.

I occasionally post decompositions of public .ipa's on the App Store, and I'm looking forward to seeing how these change over the next year.

bilbo0s · 1h ago
I was thinking that the entire time I read HN User onlyrealcuzzo's comment.

Why, on Earth, would Apple ever want to solve the problem of Apps taking up more space? That's just not good business. Way better business right now to put R&D into increased memory access speeds.

Apple would need to have a different business model entirely for them to have a business case for fixing this. They may fix it because they just want to help out they AI guys? Maybe in the future they're getting money from the AI guys or something? So fixing it starts to make a lot of sense.

But all other things being equal, the money for Apple is in this not being fixed.

adityakusupati · 1h ago
MatFormer enables pareto-optimal elasticity during inference time -- so free models between E2B and E4B as and when we need it!
cmcconomy · 2h ago
I'd love to see this deployable to edge that have a Google Coral TPU
nharada · 2h ago
Has Google continued releasing new versions of Coral? Seems like a new version with the latest TPU and enough memory specifically to support this model would be awesome for devs
mattlondon · 2h ago
I looked into this recently. Looks like it's a "no".

However there are now alternatives like the official RPi AI Hat that has between about 3x to 6x the TOPs (4 for Coral Vs 13/26 for RPi depending on model) so there is that. 20 TOPs on a RPi 5 - complete with nice vertically integrated camera etc - is quite interesting.

turnsout · 2h ago
Is this model & architecture compatible with llama.cpp and friends?