Who Invented Backpropagation?

89 nothrowaways 41 8/18/2025, 3:50:21 PM people.idsia.ch ↗

Comments (41)

cs702 · 54m ago

Whatever the facts, the OP comes across as sour grapes. The author, Jürgen Schmidhuber, believes Hopfield and Hinton did not deserve their Nobel Prize in Physics, and that Hinton, Bengio, and LeCun did not deserve their Turing Award. Evidently, many other scientists disagree, because both awards were granted in consultation with the scientific community. Schmidhuber's own work was, in fact, cited by the Nobel Prize committee as background information for the 2024 Nobel.[a] Only future generations of scientists, looking at the past more objectively, will be able to settle these disputes.

[a] https://www.nobelprize.org/uploads/2024/11/advanced-physicsp...

empiko · 27m ago

I think the unspoken claim here is that the North American scientific establishment takes credit from other sources and elevates certain personas instead of the true innovators who are overlooked. Arguing that the establishment doesn't agree with this idea is kinda pointless.

icelancer · 30m ago

Didn't click the article, came straight to the comments thinking "I bet it's Schmidhuber being salty."

Some things never change.

pncnmnp · 1h ago

I have a question that's bothered me for quite a while now. In 2018, Michael Jordan (UC Berkeley) wrote a rather interesting essay - https://medium.com/@mijordan3/artificial-intelligence-the-re... (Artificial Intelligence — The Revolution Hasn’t Happened Yet)

In it, he stated the following:

> Indeed, the famous “backpropagation” algorithm that was rediscovered by David Rumelhart in the early 1980s, and which is now viewed as being at the core of the so-called “AI revolution,” first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.

I was wondering whether anyone could point me to the paper or piece of work he was referring to. There are many citations in Schmidhuber’s piece, and in my previous attempts I've gotten lost in papers.

drsopp · 1h ago

Perhaps this:

Henry J. Kelley (1960). Gradient Theory of Optimal Flight Paths.

[1] https://claude.ai/public/artifacts/8e1dfe2b-69b0-4f2c-88f5-0...

pncnmnp · 1h ago

Thanks! This might be it. I looked up Henry J. Kelley on Wikipedia, and in the notes I found a citation to this paper from Stuart Dreyfus (Berkeley): "Artificial Neural Networks, Back Propagation and the Kelley-Bryson Gradient Procedure" (https://gwern.net/doc/ai/nn/1990-dreyfus.pdf).

I am still going through it, but the latter is quite interesting!

psYchotic · 1h ago

I found this,maybe it helps: https://gwern.net/doc/ai/nn/1986-rumelhart-2.pdf

pncnmnp · 1h ago

Apologies - I should have been clear. I was not referring to Rumelhart et al., but to pieces of work that point to "optimizing the thrusts of the Apollo spaceships" using backprop.

costates-maybe · 1h ago

I don't know if there is a particular paper exactly, but Ben Recht has a discussion of the relationship between techniques in optimal control that became prominent in the 60's, and backpropagation:

https://archives.argmin.net/2016/05/18/mates-of-costate/

duped · 1h ago

They're probably talking about Kalman Filters (1961) and LMS filters (1960).

pjbk · 50m ago

To be fair, any multivariable regulator or filter (estimator) that has a quadratic component (LQR/LQE) will naturally yield a solution similar to backpropagation when an iterative algorithm is used to optimize its cost or error function through a differentiable tangent space.

cubefox · 55m ago

> ... first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.

I think "its" refers to control theory, not backpropagation.

dataflow · 1h ago

I asked ChatGPT and it gave a plausible answer but I haven't fact checked. It says "what you’re thinking of is the “adjoint/steepest-descent” optimal-control method (the same reverse-mode idea behind backprop), developed in aerospace in the early 1960s and applied to Apollo-class vehicles." It gave the following references:

- Henry J. Kelley (1960), “Gradient Theory of Optimal Flight Paths,” ARS Journal.

- A.E. Bryson & W.F. Denham (1962), “A Steepest-Ascent Method for Solving Optimum Programming Problems,” Journal of Applied Mechanics.

- B.G. Junkin (1971), “Application of the Steepest-Ascent Method to an Apollo Three-Dimensional Reentry Optimization Problem,” NASA/MSFC report.

throawayonthe · 1h ago

it's rude to show people your llm output

aeonik · 33m ago

I don't think it's rude, it saves me from having to come up with my own prompt and wade through the back and forth to get useful insight from the LLMs, also saves me from spending my tokens.

Also, I quite love it when people clearly demarcate which part of their content came from an LLM, and specifies which model.

The little citation carries a huge amount of useful information.

The folks who don't like AI should like it too, as they can easily filter the content.

drsopp · 1h ago

Why?

danieldk · 53m ago

Because it is terribly low-effort. People are here for interesting and insightful discussions with other humans. If they were interested in unverified LLM output… they would ask an LLM?

drsopp · 44m ago

Who cares if it is low effort? I got lots of upvotes for my link to Claude about this, and pncnmnp seems happy. The downvoted comment from ChatGPT was maybe a bit spammy?

lcnPylGDnU4H9OF · 8m ago

> Who cares if it is low effort?

It's a weird thing to wonder after so many people expressed their dislike of the upthread low-effort comment with a down vote (and then another voiced a more explicit opinion). The point is that a reader may want to know that the text they're reading is something a human took the time to write themselves. That fact is what makes it valuable.

> pncnmnp seems happy

They just haven't commented. There is no reason to attribute this specific motive to that fact.

mindcrime · 1h ago

Who didn't? Depending on exactly how you interpret the notion of "inventing backpropagation" it's been invented, forgotten, re-invented, forgotten again, re-re-invented, etc, about 7 or 8 times. And no, I don't have specific citations in front of me, but I will say that a lot of interesting bits about the history of the development of neural networks (including backpropagation) can be found in the book Talking Nets: An Oral History of Neural Networks[1].

[1]: https://www.amazon.com/Talking-Nets-History-Neural-Networks/...

convolvatron · 1h ago

don't undergrad adaptive filters count?

https://en.wikipedia.org/wiki/Adaptive_filter

doesn't need a differentiation of the forward term, but if you squint it looks pretty close

pjbk · 1h ago

As it is stated, I always thought it came from formulations like Euler-Lagrange procedures in mechanics used in numeric methods for differential geometry. In fact when I recreated the algorithm as an exercise it immediately reminded me of gradient descent for kinematics, with the Jacobian calculation for each layer similar to an iterative pose calculation in generalized coordinates. I never thought it was something "novel".

mystraline · 1h ago

> BP's modern version (also called the reverse mode of automatic differentiation)

So... Automatic integration?

Proportional, integrative, derivative. A PID loop sure sounds like what they're talking about.

eigenspace · 1h ago

Reverse move automatic differentiation is not integration. It's still differentiation, but just a different method of calculating the derivative than the one you'd think to do by hand. It basically just applies the chain rule in the opposite order from what is intuitive to people.

It has a lot more overhead than regular forwards mode autodiff because you need to cache values from running the function and refer back to them in reverse order, but the advantage is that for function with many many inputs and very few outputs (i.e. the classic example is calculating the gradient of a scalar function in a high dimensional space like for gradient descent), it is algorithmically more efficient and requires only one pass through the primal function.

On the other hand, traditional forwards mode derivatives are most efficient for functions with very few inputs, but many outputs. It's essentially a duality relationship.

stephencanon · 57m ago

I don't think most people think to do either direction by hand; it's all just matrix multiplication, you can multiply them in whatever order makes it easier.

digikata · 58m ago

There are large bodies of work for optimization of state space control theory that I strongly suspect as a lot of crossover for AI, and at least has very similar mathematical structure.

e.g. optimization of state space control coefficients looks something like training a LLM matrix...

imtringued · 1h ago

Forward mode automatic differentiation creates a formula for each scalar derivative. If you have a billion parameters you have to calculate each derivative from scratch.

As the name implies, the calculation is done forward.

Reverse mode automatic differentiation starts from the root of the symbolic expression and calculates the derivative for each subexpression simultaneously.

The difference between the two is like the difference between calculating the Fibonacci sequence recursively without memoization and calculating it iteratively. You avoid doing redundant work over and over again.

bjornsing · 37m ago

The chain rule was explored by Gottfried Wilhelm Leibniz and Isaac Newton in the 17th century. Either of them would have ”invented” backpropagation in an instant. It’s obvious.

_fizz_buzz_ · 33m ago

Funny enough. For me it was the other way around. I always knew how to compute the chain rule. But really only understood what the chain rule means when I read up on what back propagation was.

fritzo · 1h ago

TIL that the same Shun'ichi Amari who founded information geometry also made early advances to gradient descent.

dicroce · 46m ago

Isn't it just kinda a natural thing once you have the chain rule?

Anon84 · 28m ago

Can we back propagate credit?

PunchTornado · 42m ago

Funny that hinton is not mentioned. Like how childish can the author be?

uoaei · 48m ago

Calling the implementation of chain rule "inventing" is most of the problem here.

caycep · 1h ago

this fight has become legendary and infamous, and also pops up on HN every 2-3 years

caycep · 1h ago

this fight has become legendary and infamous

aaroninsf · 57m ago

When I worked on neural networks, I was taught David Rumelhart.

cubefox · 1h ago

See also: The Backstory of Backpropagation - https://yuxi.ml/essays/posts/backstory-of-backpropagation/

dudu24 · 1h ago

It's just an application of the chain rule. It's not interesting to ask who invented it.

qarl · 1h ago

From the article:

Some ask: "Isn't backpropagation just the chain rule of Leibniz (1676) [LEI07-10] & L'Hopital (1696)?" No, it is the efficient way of applying the chain rule to big networks with differentiable nodes—see Sec. XII of [T22][DLH]). (There are also many inefficient ways of doing this.) It was not published until 1970 [BP1].

uoaei · 44m ago

The article says that but it's overcomplicating to the point of being actually wrong. You could, I suppose, argue that the big innovation is the application of vectorization to the chain rule (by virtue of the matmul-based architecture of your usual feedforward network) which is a true combination of two mathematical technologies. But it feels like this and indeed most "innovations" in ML is only considered as such due to brainrot derived from trying to take maximal credit for minimal work (i.e., IP).

Anna's Archive: An Update from the Team (annas-archive.org)

FFmpeg Assembly Language Lessons (github.com)

Show HN: I built an app to block Shorts and Reels (scrollguard.app)

My Retro TVs (myretrotvs.com)

Left to Right Programming: Programs Should Be Valid as They Are Typed (graic.net)

TREAD: Token Routing for Efficient Architecture-Agnostic Diffusion Training (arxiv.org)

The Weight of a Cell (asimov.press)

Who Invented Backpropagation? (people.idsia.ch)

Launch HN: Reality Defender (YC W22) – API for Deepfake and GenAI Detection (realitydefender.com)

Web apps in a single, portable, self-updating, vanilla HTML file (hyperclay.com)

Turning an iPad Pro into the Ultimate Classic Macintosh (2021) (blog.gingerbeardman.com)

Show HN: Whispering – Open-source, local-first dictation you can trust (github.com)

Typechecker Zoo (sdiehl.github.io)

The Cutaway Illustrations of Fred Freeman (5wgraphicsblog.com)

Electromechanical reshaping, an alternative to laser eye surgery (medicalxpress.com)

AWS pricing for Kiro dev tool dubbed 'a wallet-wrecking tragedy' (theregister.com)

Robots.txt is a suicide note (2011) (wiki.archiveteam.org)

A gigantic jet caught on camera: A spritacular moment for NASA astronaut (science.nasa.gov)

Image Fulgurator (2011) (juliusvonbismarck.com)

Vibe coding tips and tricks (github.com)

Sky Calendar (abramsplanetarium.org)

Walkie-Textie Wireless Communicator (technoblogy.com)

SystemD Service Hardening (roguesecurity.dev)

Countrywide natural experiment links built environment to physical activity (nature.com)

The Lives and Loves of James Baldwin (newyorker.com)

Class-action suit claims Otter AI records private work conversations (npr.org)

MCP doesn't need tools, it needs code (lucumr.pocoo.org)

Weather Radar APIs in 2025: A Founder's Complete Market Overview (rainviewer.com)

8x19 Text Mode Font Origins (os2museum.com)

Texas law gives grid operator power to disconnect data centers during crisis (utilitydive.com)

Llama-Scan: Convert PDFs to Text W Local LLMs (github.com)

MCP tools with dependent types (vlaaad.github.io)

When you're asking AI chatbots for answers, they're data-mining you (theregister.com)

LLMs and coding agents are a security nightmare (garymarcus.substack.com)

Nvidia Tilus: A Tile-Level GPU Kernel Programming Language (github.com)

Show HN: A Minimal Hacker News Reader for Apple Watch Built with SwiftUI (github.com)

One person was able to claim 20M IPs (lists.nanog.org)

Counter-Strike: A billion-dollar game built in a dorm room (nytimes.com)

Unification (2018) (eli.thegreenplace.net)

Show HN: OverType – A Markdown WYSIWYG editor that's just a textarea

Ukraine gives award to foreign vigilantes for hacks on Russia (2024) (bbc.com)

Scientists discover surprising language 'shortcuts' in birdsong – like humans (manchester.ac.uk)

AI accounts impersonating doctors on social media [video] (youtube.com)

Show HN: Doxx – Terminal .docx viewer inspired by Glow (github.com)

Mangle – a language for deductive database programming (github.com)

Google admits anti-competitive conduct involving Google Search in Australia (accc.gov.au)

Clojure Async Flow Guide (clojure.github.io)

Website is served from nine Neovim buffers on my old ThinkPad (vim.gabornyeki.com)

Non-Uniform Memory Access (NUMA) is reshaping microservice placement (codemia.io)

Modifying other people's software (natkr.com)

Who Invented Backpropagation?

Comments (41)