"4x gated residual streams" look quite weird. Is there any paper or technique report for this?
impossiblefork · 6h ago
I think this is very interesting. Especially the per-layer embedding things.
Having more than one embedding is something I've tried myself, but not separate ones for each layer.
I'm guessing it's something like h_{l+1} = MultiHeadSelfAttentionWithPositionEncodingBakedIn(MLP(h_l) + embed_l(token_ids)). So it's probably really easy to implement on toy problems to see if it works.
3abiton · 5h ago
Any resources or suggestions to learn about this? The field is moving too fast, my poor brain can't keep up.
3abiton · 4h ago
While PLE is quite innovative, the interesting part is they released their [apk on github](https://github.com/google-ai-edge/gallery), compared to linking it to play store. Interesting choice.
"4x gated residual streams" look quite weird. Is there any paper or technique report for this?
Having more than one embedding is something I've tried myself, but not separate ones for each layer.
I'm guessing it's something like h_{l+1} = MultiHeadSelfAttentionWithPositionEncodingBakedIn(MLP(h_l) + embed_l(token_ids)). So it's probably really easy to implement on toy problems to see if it works.