Who Invented Backpropagation?

89 nothrowaways 41 8/18/2025, 3:50:21 PM people.idsia.ch ↗

Comments (41)

cs702 · 54m ago
Whatever the facts, the OP comes across as sour grapes. The author, Jürgen Schmidhuber, believes Hopfield and Hinton did not deserve their Nobel Prize in Physics, and that Hinton, Bengio, and LeCun did not deserve their Turing Award. Evidently, many other scientists disagree, because both awards were granted in consultation with the scientific community. Schmidhuber's own work was, in fact, cited by the Nobel Prize committee as background information for the 2024 Nobel.[a] Only future generations of scientists, looking at the past more objectively, will be able to settle these disputes.

[a] https://www.nobelprize.org/uploads/2024/11/advanced-physicsp...

empiko · 27m ago
I think the unspoken claim here is that the North American scientific establishment takes credit from other sources and elevates certain personas instead of the true innovators who are overlooked. Arguing that the establishment doesn't agree with this idea is kinda pointless.
icelancer · 30m ago
Didn't click the article, came straight to the comments thinking "I bet it's Schmidhuber being salty."

Some things never change.

pncnmnp · 1h ago
I have a question that's bothered me for quite a while now. In 2018, Michael Jordan (UC Berkeley) wrote a rather interesting essay - https://medium.com/@mijordan3/artificial-intelligence-the-re... (Artificial Intelligence — The Revolution Hasn’t Happened Yet)

In it, he stated the following:

> Indeed, the famous “backpropagation” algorithm that was rediscovered by David Rumelhart in the early 1980s, and which is now viewed as being at the core of the so-called “AI revolution,” first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.

I was wondering whether anyone could point me to the paper or piece of work he was referring to. There are many citations in Schmidhuber’s piece, and in my previous attempts I've gotten lost in papers.

drsopp · 1h ago
Perhaps this:

Henry J. Kelley (1960). Gradient Theory of Optimal Flight Paths.

[1] https://claude.ai/public/artifacts/8e1dfe2b-69b0-4f2c-88f5-0...

pncnmnp · 1h ago
Thanks! This might be it. I looked up Henry J. Kelley on Wikipedia, and in the notes I found a citation to this paper from Stuart Dreyfus (Berkeley): "Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure" (https://gwern.net/doc/ai/nn/1990-dreyfus.pdf).

I am still going through it, but the latter is quite interesting!

psYchotic · 1h ago
pncnmnp · 1h ago
Apologies - I should have been clear. I was not referring to Rumelhart et al., but to pieces of work that point to "optimizing the thrusts of the Apollo spaceships" using backprop.
costates-maybe · 1h ago
I don't know if there is a particular paper exactly, but Ben Recht has a discussion of the relationship between techniques in optimal control that became prominent in the 60's, and backpropagation:

https://archives.argmin.net/2016/05/18/mates-of-costate/

duped · 1h ago
They're probably talking about Kalman Filters (1961) and LMS filters (1960).
pjbk · 50m ago
To be fair, any multivariable regulator or filter (estimator) that has a quadratic component (LQR/LQE) will naturally yield a solution similar to backpropagation when an iterative algorithm is used to optimize its cost or error function through a differentiable tangent space.
cubefox · 55m ago
> ... first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.

I think "its" refers to control theory, not backpropagation.

dataflow · 1h ago
I asked ChatGPT and it gave a plausible answer but I haven't fact checked. It says "what you’re thinking of is the “adjoint/steepest-descent” optimal-control method (the same reverse-mode idea behind backprop), developed in aerospace in the early 1960s and applied to Apollo-class vehicles." It gave the following references:

- Henry J. Kelley (1960), “Gradient Theory of Optimal Flight Paths,” ARS Journal.

- A.E. Bryson & W.F. Denham (1962), “A Steepest-Ascent Method for Solving Optimum Programming Problems,” Journal of Applied Mechanics.

- B.G. Junkin (1971), “Application of the Steepest-Ascent Method to an Apollo Three-Dimensional Reentry Optimization Problem,” NASA/MSFC report.

throawayonthe · 1h ago
it's rude to show people your llm output
aeonik · 33m ago
I don't think it's rude, it saves me from having to come up with my own prompt and wade through the back and forth to get useful insight from the LLMs, also saves me from spending my tokens.

Also, I quite love it when people clearly demarcate which part of their content came from an LLM, and specifies which model.

The little citation carries a huge amount of useful information.

The folks who don't like AI should like it too, as they can easily filter the content.

drsopp · 1h ago
Why?
danieldk · 53m ago
Because it is terribly low-effort. People are here for interesting and insightful discussions with other humans. If they were interested in unverified LLM output… they would ask an LLM?
drsopp · 44m ago
Who cares if it is low effort? I got lots of upvotes for my link to Claude about this, and pncnmnp seems happy. The downvoted comment from ChatGPT was maybe a bit spammy?
lcnPylGDnU4H9OF · 8m ago
> Who cares if it is low effort?

It's a weird thing to wonder after so many people expressed their dislike of the upthread low-effort comment with a down vote (and then another voiced a more explicit opinion). The point is that a reader may want to know that the text they're reading is something a human took the time to write themselves. That fact is what makes it valuable.

> pncnmnp seems happy

They just haven't commented. There is no reason to attribute this specific motive to that fact.

mindcrime · 1h ago
Who didn't? Depending on exactly how you interpret the notion of "inventing backpropagation" it's been invented, forgotten, re-invented, forgotten again, re-re-invented, etc, about 7 or 8 times. And no, I don't have specific citations in front of me, but I will say that a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets: An Oral History of Neural Networks[1].

[1]: https://www.amazon.com/Talking-Nets-History-Neural-Networks/...

convolvatron · 1h ago
don't undergrad adaptive filters count?

https://en.wikipedia.org/wiki/Adaptive_filter

doesn't need a differentiation of the forward term, but if you squint it looks pretty close

pjbk · 1h ago
As it is stated, I always thought it came from formulations like Euler-Lagrange procedures in mechanics used in numeric methods for differential geometry. In fact when I recreated the algorithm as an exercise it immediately reminded me of gradient descent for kinematics, with the Jacobian calculation for each layer similar to an iterative pose calculation in generalized coordinates. I never thought it was something "novel".
mystraline · 1h ago
> BP's modern version (also called the reverse mode of automatic differentiation)

So... Automatic integration?

Proportional, integrative, derivative. A PID loop sure sounds like what they're talking about.

eigenspace · 1h ago
Reverse move automatic differentiation is not integration. It's still differentiation, but just a different method of calculating the derivative than the one you'd think to do by hand. It basically just applies the chain rule in the opposite order from what is intuitive to people.

It has a lot more overhead than regular forwards mode autodiff because you need to cache values from running the function and refer back to them in reverse order, but the advantage is that for function with many many inputs and very few outputs (i.e. the classic example is calculating the gradient of a scalar function in a high dimensional space like for gradient descent), it is algorithmically more efficient and requires only one pass through the primal function.

On the other hand, traditional forwards mode derivatives are most efficient for functions with very few inputs, but many outputs. It's essentially a duality relationship.

stephencanon · 57m ago
I don't think most people think to do either direction by hand; it's all just matrix multiplication, you can multiply them in whatever order makes it easier.
digikata · 58m ago
There are large bodies of work for optimization of state space control theory that I strongly suspect as a lot of crossover for AI, and at least has very similar mathematical structure.

e.g. optimization of state space control coefficients looks something like training a LLM matrix...

imtringued · 1h ago
Forward mode automatic differentiation creates a formula for each scalar derivative. If you have a billion parameters you have to calculate each derivative from scratch.

As the name implies, the calculation is done forward.

Reverse mode automatic differentiation starts from the root of the symbolic expression and calculates the derivative for each subexpression simultaneously.

The difference between the two is like the difference between calculating the Fibonacci sequence recursively without memoization and calculating it iteratively. You avoid doing redundant work over and over again.

bjornsing · 37m ago
The chain rule was explored by Gottfried Wilhelm Leibniz and Isaac Newton in the 17th century. Either of them would have ”invented” backpropagation in an instant. It’s obvious.
_fizz_buzz_ · 33m ago
Funny enough. For me it was the other way around. I always knew how to compute the chain rule. But really only understood what the chain rule means when I read up on what back propagation was.
fritzo · 1h ago
TIL that the same Shun'ichi Amari who founded information geometry also made early advances to gradient descent.
dicroce · 46m ago
Isn't it just kinda a natural thing once you have the chain rule?
Anon84 · 28m ago
Can we back propagate credit?
PunchTornado · 42m ago
Funny that hinton is not mentioned. Like how childish can the author be?
uoaei · 48m ago
Calling the implementation of chain rule "inventing" is most of the problem here.
caycep · 1h ago
this fight has become legendary and infamous, and also pops up on HN every 2-3 years
caycep · 1h ago
this fight has become legendary and infamous
aaroninsf · 57m ago
When I worked on neural networks, I was taught David Rumelhart.
cubefox · 1h ago
See also: The Backstory of Backpropagation - https://yuxi.ml/essays/posts/backstory-of-backpropagation/
dudu24 · 1h ago
It's just an application of the chain rule. It's not interesting to ask who invented it.
qarl · 1h ago
From the article:

Some ask: "Isn't backpropagation just the chain rule of Leibniz (1676) [LEI07-10] & L'Hopital (1696)?" No, it is the efficient way of applying the chain rule to big networks with differentiable nodes—see Sec. XII of [T22][DLH]). (There are also many inefficient ways of doing this.) It was not published until 1970 [BP1].

uoaei · 44m ago
The article says that but it's overcomplicating to the point of being actually wrong. You could, I suppose, argue that the big innovation is the application of vectorization to the chain rule (by virtue of the matmul-based architecture of your usual feedforward network) which is a true combination of two mathematical technologies. But it feels like this and indeed most "innovations" in ML is only considered as such due to brainrot derived from trying to take maximal credit for minimal work (i.e., IP).