Gemma 3n Architectural Innovations – Speculation and poking around in the model

14 nolist_policy 5 5/25/2025, 7:08:34 PM old.reddit.com ↗

Comments (5)

impossiblefork · 14h ago
I think this is very interesting. Especially the per-layer embedding things.

Having more than one embedding is something I've tried myself, but not separate ones for each layer.

I'm guessing it's something like h_{l+1} = MultiHeadSelfAttentionWithPositionEncodingBakedIn(MLP(h_l) + embed_l(token_ids)). So it's probably really easy to implement on toy problems to see if it works.

3abiton · 13h ago
Any resources or suggestions to learn about this? The field is moving too fast, my poor brain can't keep up.
impossiblefork · 4h ago
Basically you'd familiarize yourself with transformers by implementing different variants of them, and changing them around according to your own ideas on different toy datasets.

Then you'd figure out a set of toy tasks that you like and think are important.

In this particular case you take something like NanoGPT, go to model.py, go to class GPT, go to __init__, modify the self.transformer ModuleDict by changing nn.Embedding to a ModuleList of nn.Embedding, then you change the for loop at line 180 to loop over a range, modify forward by adding x = x + self.transformer.wte[i], something like that I think.

I haven't tried yet though (I've got a terrible cold, so I am on social media instead of doing anything sensible).

limoce · 10h ago
> https://preview.redd.it/wca7kzfq5w2f1.png?width=1190&format=...

"4x gated residual streams" look quite weird. Is there any paper or technique report for this?

3abiton · 12h ago
While PLE is quite innovative, the interesting part is they released their [apk on github](https://github.com/google-ai-edge/gallery), compared to linking it to play store. Interesting choice.