Gemma 3 270M re-implemented in pure PyTorch for local tinkering

256 ModelForge 41 8/20/2025, 2:01:26 PM github.com ↗

Comments (41)

canyon289 · 6h ago

Hey all, I created this model with a top notch team. I answered many questions last week when this hit the front page, and happy to answer more here as well.

https://news.ycombinator.com/item?id=44902148

Personally I'm excited that you all have access to this model now and hope you all get value out of using them.

riedel · 2h ago

Very stupid question: why does the tflite model output only '[multimodal][multimodal]' when executed on GPU in the AI edge gallery app, while fully working on the CPU.

WithinReason · 5h ago

I would like to know your thoughts on using 2/3 of such a small the model's size for embeddings. What would be different if you used a byte-level vocabulary and spent the parameter budget on transformer parameters instead? I think you would lose performance (tok/s) but might gain accuracy.

canyon289 · 5h ago

At this small scale the embeddings indeed were a big focus. Consider this thought process.

The tokens themselves are a form of compression. Lets say we have the word "WaffleHouse", character level this would be 11 tokens, but with an embedder this would be perhaps 2 or 3 tokens (I didn't actually run through the tokenizer but we could verify precisely). This matters a lot for on device processing especially.

So while we could get more intelligence out of the model by bumping up the "knowledge" parameters, the device would need to process more input and output tokens.

Another advantage on small devices is the embeddings are just a lookup table which requires little to no computation. Its the rest of the parameters that have the expensive matrix multplications, so if we increased those we'd also be increasing the number of FLOPs needed for a forward pass.

This blog post explains it well. https://www.adamcasson.com/posts/transformer-flops

So all this to say is there are definite tradeoffs between model size, performance on evals, and compute cost. We ran many internal experiments with different choices to see could work well, and then picked what we believed work will best for the open community.

Scene_Cast2 · 4h ago

How would this matrix get trained with PyTorch? I currently have a toy Transformer network - I ended up marking the matrix as sparse and using SparseAdam - gives a bit of a performance boost, but at the same time I can't use torch.compile() on the fetch from this matrix.

WithinReason · 3h ago

Makes sense, thank you.

tarruda · 5h ago

Thanks for your work, it is really an amazing small LM.

Can you share what kind of hardware is necessary to train it, and how long it took?

canyon289 · 3h ago

Thank you!

The Gemma3 technical report contains many details on training setup https://arxiv.org/pdf/2503.19786

This was released with the initial batch of Gemma3 so it doesn't contain the 270m details, nonetheless you'll get a good idea of what it takes to build these models.

GaggiX · 6h ago

I imagine you and your team have finetuned the model on different tasks, can you share some results? (I have only seen the alien NPC finetuning)

canyon289 · 5h ago

The Unsloth folks have finetuning numbers. Linking their post here https://www.reddit.com/r/unsloth/comments/1mq5hbb/google_gem...

owebmaster · 3h ago

Does it have function calls? Can we use it with MCP?

canyon289 · 3h ago

It can possibly perform basic prompted FC but I wouldn't get your hopes up. It should be to be a solild FC model if trained on specific tools and format. I would not expect great MCP performance because the context window is 32k and most MCP servers I've see implicitly assume massive context windows.

shekhar101 · 4h ago

Can someone (or OP) point me to a recipe to fine tune a model like this for natural language tasks like complicated NER or similar workflows? I tried finetuning Gemma3 270M when it came out last week without any success. A lot of tutorials are geared towards chat applications and role playing but I feel this model could be great for usecases like mine where I am trying to extract clean up and extract data from PDFs with entity identification and such.

lgessler · 33m ago

If you're really just doing traditional NER (identifying non-overlapping spans of tokens which refer to named entities) then you're probably better off using encoder-only (e.g. https://huggingface.co/dslim/bert-large-NER) or encoder-decoder (e.g. https://huggingface.co/dbmdz/t5-base-conll03-english) models. These models aren't making headlines anymore because they're not decoder-only, but for established NLP tasks like this which don't involve generation, I think there's still a place for them, and I'd assume that at equal parameter counts they quite significantly outperform decoder-only models at NER, depending on the nature of the dataset.

nolist_policy · 56m ago

This is using the gemma-llm python library which uses JAX in the background: https://gemma-llm.readthedocs.io/en/latest/colab_finetuning....

hmottestad · 2h ago

Have you tried this one here by any chance?

https://huggingface.co/dslim/bert-base-NER

Just wondering if it’s worth testing and what it would be most useful for.

kace91 · 3h ago

This might be a very basic question, but as a dev whose only interaction with models is using the main commercial ones (sonnet, ChatGPT and the like), what are some usecases for these smaller local models?

What usages can be reasonable to expect from them? Are there uses out of the box or does one have to go through some custom post-training to get useful behavior?

I feel like there is a huge gap between understanding models as a user of commercial tools and the kind of discussions happening in these threads, but I’m not sure what are the in-between steps.

ModelForge · 1h ago

I'd say the common ones (besides educational) are

- private, on-device models (possibly with lower latency than models via web API); also edge devices

- algorithm research (faster and cheaper to prototype new ideas)

- cheap tasks, like classification/categorization; sure, you don't need a decoder-style LLM for that, but it has the advantage of being more free-form, which is useful in many scenarios; or maybe a sanity checker for grammar; or even a router to other model (GPT-5 style)

canyon289 · 3h ago

Its a crucial question. I wrote up a long answer here. Let me know it helps

https://news.ycombinator.com/item?id=44913558

barrkel · 3h ago

Summarization, very basic tool use, without needing to go across the internet and back, and zero cost because of edge compute.

_giorgio_ · 3h ago

Maybe also secrecy and privacy.

keeeba · 4h ago

What use-cases do you see for the 270M’s embeddings, and should we be sticking to token embeddings or can we meaningfully pool for sentence/document embeddings?

Do we need to fine-tune for the embeddings to be meaningful at the sentence/document level?

lsb · 5h ago

That’s wild that with a KV cache and compilation on the Mac CPU you are faster than on an A100 GPU.

ModelForge · 59m ago

Could be an artifact of the small size not fully taking advantage of the GPU. For example, for the slightly larger Qwen3 0.6B model the A100 is faster (you can see it when scrolling to the bottom here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11...)

ladberg · 1h ago

Given that the compiled version is slower than then eager version on A100, there's definitely something suboptimal happening there

ModelForge · 55m ago

No the compiled version is actually faster.

From that table, the A100 tok/sec (larger is faster) numbers are:

- Eager: 28

- Compiled: 128

And

- KV cache eager: 26

- KV cache compiled: 99

The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly

punnerud · 4h ago

Because on Mac the CPU and GPU share memory, but A100 need to transfer to RAM/CPU on the parts that’s not supported by GPU?

(My first guess)

Weryj · 5h ago

This would be because the GPU can’t fill its waveform and hide memory latency, no? I’m curious for a reason why

eachro · 4h ago

If you wanted to train it from scratch, how long would it take on a reasonable GPU setup?

rck · 2h ago

For the sake of comparison, you can train a 124M model on a 3090 (see nanoGPT). In that case, each batch ends up having about 500,000 tokens and takes maybe around 10ish seconds to run forward and backward. Then the 6 trillion tokens that this model was trained on would take about 4 years, approximately. Or just "too long" for a shorter answer.

canyon289 · 3h ago

The world reasonable is vague but assuming you mean something that could be run in a residential unit it would long a very long time if training from pure scratch.

This is part of the rationale for releasing this model. Now you don't have to start from scratch and finetuning is reasonable on a wide variety of hardware, including reasonable GPU setups (and smaller)

quesne · 2h ago

Thought it was a new 3270 interface, bummed.

n0vella · 6h ago

Do you think these very small models have some utility in the real world? Apart from learning and academic purposes of course.

canyon289 · 5h ago

Yes! To me the primary value is not just as a teaching or toy model. I see a lot o value in repeatable tasks if we think about enterprise and a local fast developer model for individual usage.

Here's some examples that are inspired by previous roles I had outside of Google, where a business I was working in needed real time text processing.

This tutorials were made with Gemma versions from a year ago, but could now be recreated with Gemma 270m

https://developers.googleblog.com/en/gemma-for-streaming-ml-... https://www.youtube.com/watch?v=YxhzozLH1Dk

yawnxyz · 1h ago

If you LoRa them you can make them VERY VERY good at a small narrow set of tasks, e.g.:

- reply in a specific way, like a specific JSON schema, or in the voice of a character - be very good at classifying text (e.g. emails, or spam) - be a great summarizer for large amounts of text, e.g. turn emails into short titles or url slugs - adding tags/categories per your pre-defined rules (e.g. for communities, tagging content, marketing) - for detecting spam, or duplicates, or flagging things

You won't be able to write code or prose with these, but they're great for a huge array of very narrow set of use cases

What's neat about "stupid" models like this is that they're less likely to go off and dream up a bunch of irrelevant content, because they don't know much about the world / won't have too much context to pull from

numpad0 · 4h ago

It seems to be more often correct than wrong for multilingual translation tasks(source text from[1][2]). Rough, but probably useful as traveler's phrase books.

  You 
  Translate to English: Рошешуа́р (фр. Rochechouart) — місто, муніципалітет у Франції, у регіоні Нова Аквітанія, департамент Верхня В'єнна. Населення — 3637 осіб (2022)[1]. Муніципалітет розташований на відстані[2] близько 360 км на південь від Парижа, 34 км на захід від Ліможа.

  gemma-3-270m-it-qat
  Here's the translation:

  "Rochechouart is a city in France, a municipality located in the region of New Aeterna, department within the Upper Vienne Province. The population is 3637 people (2022). The city is situated at a distance of approximately 360 kilometers from Paris, 34 km from Lyon.

  You
  Translate to English: Bộ Công Thương đề xuất hỗ trợ hộ gia đình lắp điện mặt trời mái nhà, có hệ thống lưu trữ tối đa 2-3 triệu đồng, hỗ trợ lãi suất vay trong tối đa ba năm. Nội dung được Bộ Công Thương đưa ra tại dự thảo lần 3 Quyết định của Thủ tướng về chính sách hỗ trợ cho hộ gia đình lắp đặt điện mặt trời mái nhà tự sản, tự tiêu và hệ thống lưu trữ điện.

  gemma-3-270m-it-qat
  The Ministry of Industry and Trade proposes supporting households with solar panels in their homes, with a maximum savings of 2-3 million đồng, providing loan assistance for up to 3 years. The content of the Ministry's decision at the Third Review of the Government's Policy on Support for Households Installing Solar Panels in Homes, approved by the Prime Minister about the policy support for households installing solar panels themselves, and the system storage capacity is maximum 2-3 million đồng.

1: https://uk.wikipedia.org/wiki/%D0%A0%D0%BE%D1%88%D0%B5%D1%88...

2: https://vnexpress.net/lap-dien-mat-troi-mai-nha-tu-dung-co-t...

magicalhippo · 17m ago

For comparison, here's what I got from the 27B variant:

  gemma3:27b-it-qat
  Rochechouart (French: Rochechouart) is a town and commune in France, in the Nouvelle-Aquitaine region, Department of Haute-Vienne. The population is 3,637 (2022)[1]. The commune is located approximately 360 km south of Paris, 34 km west of Limoges.

  gemma3:27b-it-qat
  The Ministry of Industry and Trade proposes supporting households installing rooftop solar power systems, with a maximum support of 2-3 million VND for systems including energy storage. This support would also include interest rate subsidies on loans for a maximum of three years. This content was presented by the Ministry of Industry and Trade in the third draft of a Decision by the Prime Minister regarding support policies for households installing self-generated, self-consumed rooftop solar power systems and energy storage systems.

colechristensen · 5h ago

Sure, interacting with natural language without expectation that the model contains knowledge. Good for things like tool use and embeddings where the information is all retrieved.

throw310822 · 5h ago

Are these small models are trained to privilege "raw intelligence" over factual knowledge? Is there any indication of how much of current model is dedicated to the knowledge of multiple languages and tons of facts rather than pure understanding and reasoning?

canyon289 · 5h ago

The evaluations provide this indication. You'll see MMLU, GPQA, Big Bench etc in reports for many models. Those numbers provide the indication you're looking for.

To answer a question you didn't ask. With small models especially we need to make choices as to which to focus on. For this model we focused on text summarization and instruction following, with the idea that users would finetune to gain performance on the task set that is relevant to them

_giorgio_ · 3h ago

what a legend

The Pixel 10 Pro puts generative AI right inside the camera (theverge.com)

Great Programmers Write Debuggable Code (2013) (henrikwarne.com)

A Life Changing Attack Made Me Risk It All to Follow My Dreams [video] (youtube.com)

A firewall for AI agents (ex-Microsoft AI) (saviradev.substack.com)

Express middleware for JWT-based authentication against FusionAuth (github.com)

Fractal Drum Machine Plays Any Beat [video] (youtube.com)

Non-contact radiofrequency stimulation to the olfactory nerve of human subjects (pubs.aip.org)

Do You Like Vibe Retrieval (RAG)?

AI tooling must be disclosed for contributions (github.com)

Take-Two Guts 'BioShock' Studio After a Decade of Development (bloomberg.com)

MIT Maker Portfolio: 3D-Printed Marble Machines [video] (youtube.com)

Darvaza Gas Crater (en.wikipedia.org)

Slice: SAST and LLM Interprocedural Context Extractor (noperator.dev)

Smithery.ai (smithery.ai)

Pika Is Hiring Engineers (jobs.ashbyhq.com)

Pixel Watch 4 Launches with Satellite SOS: Hands-On (dcrainmaker.com)

Serbian scientists experiment with mealworms to degrade polystyrene (reuters.com)

Tim O'Reilly – Is AI a "Normal Technology"? (oreilly.com)

Vikings were captivated by silver – analysis reveals how far they travelled (theconversation.com)

Recreationally overengineering my Location History (overengineer.dev)

What Happens When a Generation of Scientists Changes Its Mind (scientificamerican.com)

Show HN: 1999date – Dating Like It's 1999 (1999date.com)

New zero-day startup offers $20M for tools that can hack any smartphone (techcrunch.com)

Pope Leo to share papal apartments with 'flatmates' (thecatholicherald.com)

Wavacity – Online Audio Editor Based on Audacity (wavacity.com)

Vienna has been declared a renters' utopia – here's why (theguardian.com)

Moving Money Isn't the Same as Building a Business (anildash.com)

Stephanie Shirley, Who Created a Tech World for Women, Dies at 91 (nytimes.com)

Let `jj absorb` help you keep a clean commit history (pauladamsmith.com)

The CodeRabbit exploit: proof that "boring mistakes" cause big security failures (railsfever.com)

Disco: Running Commodity Operating Systems on Scalable Multiprocessors [pdf] (lass.cs.umass.edu)

Oregon Man Accused of Operating One of Most Powerful Attack 'Botnets' Ever Seen (wsj.com)

iOS/iPadOS/macOS updates fix actively exploited ImageIO memory corruption (support.apple.com)

In defence of the Online Safety Act (thecritic.co.uk)

Becoming Capable

Cycle-Accurate 8088 Emulation [video] (youtube.com)

UNIX: A History and a Memoir by Brian Kernighan [video] (youtube.com)

Why American painter Lois Dodd is finally getting her dues at 98 (ft.com)

Show HN: I built a tool to convert WhatsApp chats to PDF (chattopdf.app)

Nuclear can dial down/up 80% on hourly basis (ft.com)

Record Label Is Trying to Silence Me (youtube.com)

The Four Dimensions of Tone of Voice (nngroup.com)

Learn the Basics of Synthesizers (learningsynths.ableton.com)

Stop Fighting Housing Development vs. Flood Control: Houston vs. Jersey City (governance.fyi)

NASA's 10 rules for safety critical code (en.wikipedia.org)

Trump says U.S. will not approve solar or wind power projects (cnbc.com)

Humble Bundle: Software Architecture (humblebundle.com)

Game math: precise control over numeric springing (allenchou.net)

Show HN: PlutoPrint – Generate PDFs and PNGs from HTML with Python (github.com)

admin pls delete (status.x.ai)

Gemma 3 270M re-implemented in pure PyTorch for local tinkering

Comments (41)