I think this is very interesting. Especially the per-layer embedding things.
Having more than one embedding is something I've tried myself, but not separate ones for each layer.
I'm guessing it's something like h_{l+1} = MultiHeadSelfAttentionWithPositionEncodingBakedIn(MLP(h_l) + embed_l(token_ids)). So it's probably really easy to implement on toy problems to see if it works.
3abiton · 13h ago
Any resources or suggestions to learn about this? The field is moving too fast, my poor brain can't keep up.
impossiblefork · 4h ago
Basically you'd familiarize yourself with transformers by implementing different variants of them, and changing them around according to your own ideas on different toy datasets.
Then you'd figure out a set of toy tasks that you like and think are important.
In this particular case you take something like NanoGPT, go to model.py, go to class GPT, go to __init__, modify the self.transformer ModuleDict by changing nn.Embedding to a ModuleList of nn.Embedding, then you change the for loop at line 180 to loop over a range, modify forward by adding x = x + self.transformer.wte[i], something like that I think.
I haven't tried yet though (I've got a terrible cold, so I am on social media instead of doing anything sensible).
"4x gated residual streams" look quite weird. Is there any paper or technique report for this?
3abiton · 12h ago
While PLE is quite innovative, the interesting part is they released their [apk on github](https://github.com/google-ai-edge/gallery), compared to linking it to play store. Interesting choice.
Having more than one embedding is something I've tried myself, but not separate ones for each layer.
I'm guessing it's something like h_{l+1} = MultiHeadSelfAttentionWithPositionEncodingBakedIn(MLP(h_l) + embed_l(token_ids)). So it's probably really easy to implement on toy problems to see if it works.
Then you'd figure out a set of toy tasks that you like and think are important.
In this particular case you take something like NanoGPT, go to model.py, go to class GPT, go to __init__, modify the self.transformer ModuleDict by changing nn.Embedding to a ModuleList of nn.Embedding, then you change the for loop at line 180 to loop over a range, modify forward by adding x = x + self.transformer.wte[i], something like that I think.
I haven't tried yet though (I've got a terrible cold, so I am on social media instead of doing anything sensible).
"4x gated residual streams" look quite weird. Is there any paper or technique report for this?