Show HN: I was curious about spherical helix, ended up making this visualization (visualrambling.space)

Hey all, I created this model with a top notch team. I answered many questions last week when this hit the front page, and happy to answer more here as well.

https://news.ycombinator.com/item?id=44902148

Personally I'm excited that you all have access to this model now and hope you all get value out of using them.

WithinReason · 1h ago

I would like to know your thoughts on using 2/3 of such a small the model's size for embeddings. What would be different if you used a byte-level vocabulary and spent the parameter budget on transformer parameters instead? I think you would lose performance (tok/s) but might gain accuracy.

canyon289 · 1h ago

At this small scale the embeddings indeed were a big focus. Consider this thought process.

The tokens themselves are a form of compression. Lets say we have the word "WaffleHouse", character level this would be 11 tokens, but with an embedder this would be perhaps 2 or 3 tokens (I didn't actually run through the tokenizer but we could verify precisely). This matters a lot for on device processing especially.

So while we could get more intelligence out of the model by bumping up the "knowledge" parameters, the device would need to process more input and output tokens.

Another advantage on small devices is the embeddings are just a lookup table which requires little to no computation. Its the rest of the parameters that have the expensive matrix multplications, so if we increased those we'd also be increasing the number of FLOPs needed for a forward pass.

This blog post explains it well. https://www.adamcasson.com/posts/transformer-flops

So all this to say is there are definite tradeoffs between model size, performance on evals, and compute cost. We ran many internal experiments with different choices to see could work well, and then picked what we believed work will best for the open community.

Scene_Cast2 · 18m ago

How would this matrix get trained with PyTorch? I currently have a toy Transformer network - I ended up marking the matrix as sparse and using SparseAdam - gives a bit of a performance boost, but at the same time I can't use torch.compile() on the fetch from this matrix.

tarruda · 1h ago

Thanks for your work, it is really an amazing small LM.

Can you share what kind of hardware is necessary to train it, and how long it took?

GaggiX · 2h ago

I imagine you and your team have finetuned the model on different tasks, can you share some results? (I have only seen the alien NPC finetuning)

canyon289 · 1h ago

The Unsloth folks have finetuning numbers. Linking their post here https://www.reddit.com/r/unsloth/comments/1mq5hbb/google_gem...

shekhar101 · 30m ago

Can someone (or OP) point me to a recipe to fine tune a model like this for natural language tasks like complicated NER or similar workflows? I tried finetuning Gemma3 270M when it came out last week without any success. A lot of tutorials are geared towards chat applications and role playing but I feel this model could be great for usecases like mine where I am trying to extract clean up and extract data from PDFs with entity identification and such.

keeeba · 24m ago

What use-cases do you see for the 270M’s embeddings, and should we be sticking to token embeddings or can we meaningfully pool for sentence/document embeddings?

Do we need to fine-tune for the embeddings to be meaningful at the sentence/document level?

eachro · 18m ago

If you wanted to train it from scratch, how long would it take on a reasonable GPU setup?

lsb · 1h ago

That’s wild that with a KV cache and compilation on the Mac CPU you are faster than on an A100 GPU.

punnerud · 41m ago

Because on Mac the CPU and GPU share memory, but A100 need to transfer to RAM/CPU on the parts that’s not supported by GPU?

(My first guess)

Weryj · 1h ago

This would be because the GPU can’t fill its waveform and hide memory latency, no? I’m curious for a reason why

n0vella · 2h ago

Do you think these very small models have some utility in the real world? Apart from learning and academic purposes of course.

canyon289 · 1h ago

Yes! To me the primary value is not just as a teaching or toy model. I see a lot o value in repeatable tasks if we think about enterprise and a local fast developer model for individual usage.

Here's some examples that are inspired by previous roles I had outside of Google, where a business I was working in needed real time text processing.

This tutorials were made with Gemma versions from a year ago, but could now be recreated with Gemma 270m

https://developers.googleblog.com/en/gemma-for-streaming-ml-... https://www.youtube.com/watch?v=YxhzozLH1Dk

numpad0 · 20m ago

It seems to be more often correct than wrong for multilingual translation tasks(source text from[1][2]). Rough, but probably useful as traveler's phrase books.

  You 
  Translate to English: Рошешуа́р (фр. Rochechouart) — місто, муніципалітет у Франції, у регіоні Нова Аквітанія, департамент Верхня В'єнна. Населення — 3637 осіб (2022)[1]. Муніципалітет розташований на відстані[2] близько 360 км на південь від Парижа, 34 км на захід від Ліможа.

  gemma-3-270m-it-qat
  Here's the translation:

  "Rochechouart is a city in France, a municipality located in the region of New Aeterna, department within the Upper Vienne Province. The population is 3637 people (2022). The city is situated at a distance of approximately 360 kilometers from Paris, 34 km from Lyon.

  You
  Translate to English: Bộ Công Thương đề xuất hỗ trợ hộ gia đình lắp điện mặt trời mái nhà, có hệ thống lưu trữ tối đa 2-3 triệu đồng, hỗ trợ lãi suất vay trong tối đa ba năm. Nội dung được Bộ Công Thương đưa ra tại dự thảo lần 3 Quyết định của Thủ tướng về chính sách hỗ trợ cho hộ gia đình lắp đặt điện mặt trời mái nhà tự sản, tự tiêu và hệ thống lưu trữ điện.

  gemma-3-270m-it-qat
  The Ministry of Industry and Trade proposes supporting households with solar panels in their homes, with a maximum savings of 2-3 million đồng, providing loan assistance for up to 3 years. The content of the Ministry's decision at the Third Review of the Government's Policy on Support for Households Installing Solar Panels in Homes, approved by the Prime Minister about the policy support for households installing solar panels themselves, and the system storage capacity is maximum 2-3 million đồng.

1: https://uk.wikipedia.org/wiki/%D0%A0%D0%BE%D1%88%D0%B5%D1%88...

2: https://vnexpress.net/lap-dien-mat-troi-mai-nha-tu-dung-co-t...

colechristensen · 1h ago

Sure, interacting with natural language without expectation that the model contains knowledge. Good for things like tool use and embeddings where the information is all retrieved.

throw310822 · 1h ago

Are these small models are trained to privilege "raw intelligence" over factual knowledge? Is there any indication of how much of current model is dedicated to the knowledge of multiple languages and tons of facts rather than pure understanding and reasoning?

canyon289 · 1h ago

The evaluations provide this indication. You'll see MMLU, GPQA, Big Bench etc in reports for many models. Those numbers provide the indication you're looking for.

To answer a question you didn't ask. With small models especially we need to make choices as to which to focus on. For this model we focused on text summarization and instruction following, with the idea that users would finetune to gain performance on the task set that is relevant to them

Show HN: I was curious about spherical helix, ended up making this visualization (visualrambling.space)

Crash Cows (beza1e1.tuxen.de)

Gemma 3 270M re-implemented in pure PyTorch for local tinkering (github.com)

OPA maintainers and Styra employees hired by Apple (blog.openpolicyagent.org)

Closer to the Metal: Leaving Playwright for CDP (browser-use.com)

An Update on Pytype (github.com)

14.ai (YC W24) is hiring eng (TS/Effect) in SF to build the AI-native Zendesk (14.ai)

Launch HN: Channel3 (YC S25) – A database of every product on the internet

Improvements to OCaml code editing: the basics of a refactor engine (tarides.com)

Pixel 10 Phones (blog.google)

Tidewave Web: in-browser coding agent for Rails and Phoenix (tidewave.ai)

Show HN: Luminal – Open-source, search-based GPU compiler (github.com)

Sequoia backs Zed (zed.dev)

Show HN: What country you would hit if you went straight where you're pointing (apps.apple.com)

AGENTS.md – Open format for guiding coding agents (agents.md)

How to Think About GPUs (jax-ml.github.io)

MapLibre Tile: A next generation geospatial format optimized for rendering (arxiv.org)

Show HN: Strudel Flow, a pattern sequencer built with Strudel and React Flow (github.com)

Best Options for Using AI in Chip Design (semiengineering.com)

Phone Searches at the US Border Hit a Record High (wired.com)

Show HN: Anchor Relay – A faster, easier way to get Let's Encrypt certificates (anchor.dev)

The Block Stacking Problem (sites.pitt.edu)

Show HN: Project management system for Claude Code (github.com)

Databricks is raising a Series K Investment at >$100B valuation (databricks.com)

Ordered Insertion Optimization in OrioleDB (orioledb.com)

Show HN: Typed-arrow – compile‑time Arrow schemas for Rust (github.com)

How to Draw a Space Invader (muffinman.io)

Mirrorshades: The Cyberpunk Anthology (1986) (rudyrucker.com)

Ask HN: Why does the US Visa application website do a port-scan of my network?

Modern CI is too complex and misdirected (2021) (gregoryszorc.com)

Digg.com is back (digg.com)

Copilot broke audit logs, but Microsoft won't tell customers (pistachioapp.com)

How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos (research.kudelskisecurity.com)

AWS in 2025: The Stuff You Think You Know That's Now Wrong (lastweekinaws.com)

The End of Handwriting (wired.com)

D2 (text to diagram tool) now supports ASCII renders (d2lang.com)

Dagger and opencode and agnostic agents and SSH app = most portable dev kit

FEMA Now Requires Disaster Victims to Have an Email Address (wired.com)

Tiny microbe challenges the definition of cellular life (nautil.us)

Fast and observable background job processing for .NET (github.com)

NSA's Acting Director Tried to Save Top Scientist from Purge (nytimes.com)

Emacs as your video-trimming tool (xenodium.com)

Rails Charts Using ECharts from Apache (github.com)

The value of hitting the HN front page (mooreds.com)

Customizing Lisp REPLs (aartaka.me)

How I Made Ruby Faster Than Ruby (noteflakes.com)

Is the A.I. Sell-Off the Start of Something Bigger? (nytimes.com)

The forgotten meaning of "jerk" (languagehat.com)

The Four Stages of Objective-Smalltalk (blog.metaobject.com)

Gaussian Processes for Machine Learning (2006) [pdf] (gaussianprocess.org)

Gemma 3 270M re-implemented in pure PyTorch for local tinkering

Comments (19)