Self-attention transforms a prompt into a low-rank weight-update

13 Labo333 1 7/28/2025, 6:29:18 AM arxiv.org โ†—

Comments (1)

imtringued ยท 5h ago
>However, in the case of In-Context-Learning (ICL), there is no immediate explicit weight update that could explain the emergent dynamical nature of trained LLMs that seem to re-organize or reconfigure themselves at the instruction of a user prompt. This mysterious and extremely helpful property of LLMs has led researchers to conjecture an implicit form of weight updates taking place at inference time when a prompt is consumed [6โ€“11]. Recent works have even been able to show that toy models of transformer blocks implicitly performs a sort of gradient descent optimization [7, 9, 10].

I wouldn't call it gradient descent. Residual connections of the form x_{i+1} = x_i + f(x_i) essentially form an "update rule" with one iteration per layer. Newton's method, gradient descent, fixed point iteration, conjugate gradient methods, ODE integration, etc can all be expressed as an "update rule" that takes a previous value and adds a modifier to produce a new value. It would be more accurate to say that each residual layer is a universal approximator of any imaginable update rule including gradient descent.