Mmh, this is a bit sloppy. The derivative of a function f::a -> b is a function Df::a -> a -o b where the second funny arrow indicates a linear function. I.e. the derivative Df takes a point in the domain and returns a linear approximation of f (the jacobian) at that point. And it’s always the jacobian, it’s just that when f is R -> R we conflate the jacobian (a 1x1 matrix in this case) with the number inside of it.
ndriscoll · 1h ago
A perhaps nicer way to look at things[0] is to hold onto your base points explicitly and say Df:: a -> (b, a -o b) = (f(p),A(p)) where f(p+v)≈f(p)+A(p)v. Then you retain the information you need to define composition Dg∘Df=((Dg._1∘Df._1)(p), Dg(Df(p)._1)_.2∘Df(p)._2). i.e the chain rule.
Yes! I love Conal Eliot’s work. The one you wrote is the compositional derivative which augments the regular derivative by also returning the function itself (otherwise composition won’t work well). For anyone interested look up “the simple essence of automatic differentiation”.
sestep · 2h ago
A bit more advanced than this post, but for calculating Jacobians and Hessians, the Julia folks have done some cool work recently building on classical automatic differentiation research: https://iclr-blogposts.github.io/2025/blog/sparse-autodiff/
amelius · 1h ago
> (...) The derivative of w with respect to x. Another way of saying that is “If you added 1 to x before plugging it into the function, this is how much w would change
Incorrect!
whatever1 · 2h ago
I can look around me and find the minimum of anything without tracing its surface and following the gradient. I can also identify immediately global minima instead of local ones.
We all can do it in 2-3D. But our algorithms don’t do it. Even in 2D.
Sure if I was blindfolded, feeling the surface and looking for minimization direction would be the way to go. But when I see, I don’t have to.
What are we missing?
ks2048 · 2h ago
When you look at a 2D surface, you directly observe all the values on that surface.
For a loss-function, the value at each point must be computed.
You can compute them all and "look at" the surface and just directly choose the lowest - that is called a grid search.
For high dimensions, there's just way too many "points" to compute.
samsartor · 1h ago
And remember, optimization problems can be _incredibly_ high-dimensional. A 7B parameter LLM is a 7-billion-dimensional optimization landscape. A grid-search with a resolution of 10 (ie 10 samples for each dimension) would requre evaluating the loss function 10^(7*10^9) times. That is, the number of evaluations is a number with 7B digits.
Chinjut · 2h ago
You're thinking of situations where you are able to see a whole object at once. If you were dealing with an object too large to see all of, you'd have to start making decisions about how to explore it.
3eb7988a1663 · 20m ago
The mental image I like: imagine you are lost in a hilly region with incredibly dense fog such that you can only see one foot directly in front of you. How do you find the base of the valley?
Gradient descent: take a step in the steepest downward direction. Look around and repeat. When you reach a level area, how do you know you are at the lowest point?
jpeloquin · 1h ago
Evaluating a function using a densely spaced grid and plotting it does work. This is brute-force search. You will see the global minima immediately in the way you describe, provided your grid is dense enough to capture all local variation.
It's just that when the function is implemented on the computer, evaluating so many points takes a long time, and using a more sophisticated optimization algorithm that exploits information like the gradient is almost always faster. In physical reality all the points already exist, so if they can be observed cheaply the brute force approach works well.
Edit: Your question was good. Asking superficially-naive questions like that is often a fruitful starting point for coming up with new tricks to solve seemingly-intractable problems.
whatever1 · 21m ago
Thanks!
It does feels to me that we do some sort of sampling, definitely is not a naive grid search.
Also I find it easier to find the minima in specific directions (up, down, left, right) rather than let’s say a 42 degree one. So some sort of priors are probably used to improve sample efficiency.
GuB-42 · 1h ago
Your eyes compute gradients, as part of the shitton of visual processing your brain does to get an estimate of where the local and global minima are.
It is not perfect though, see the many optical illusions.
But we follow gradients all the time, consciously or not. You know you are at the bottom of the hole when all the paths go up for instance.
i_am_proteus · 2h ago
Without looking up the answer (because someone has already computed this for you), how would you find the highest geographic point (highest elevation) in your country?
cinntaile · 1h ago
What if you're trying to find the minimum of something that you can't see? Or what if the differences are so small that you can't perceive them with your eyes even though you can see?
adrianN · 2h ago
The inputs you can process visually are of trivial size even for naive algorithms, and probably also simple instances. I certainly can’t find global minima in 2d for any even slightly adversarial function.
hackinthebochs · 2h ago
You're ignoring all the calculations that go on unconsciously that realize your conscious experience of "immediately" apprehending the global minima.
fancyfredbot · 2h ago
Your visual cortex is a massively parallel processor.
pestatije · 2h ago
touch and sight sense essentially the same...the difference is in the magnitudes involved
flufluflufluffy · 2h ago
Fantastic post! As short as it needs to be while still communicating its points effectively. I love walking up the generalization levels in math.
[0] which I learned from this talk https://youtube.com/watch?v=17gfCTnw6uE
Incorrect!
We all can do it in 2-3D. But our algorithms don’t do it. Even in 2D.
Sure if I was blindfolded, feeling the surface and looking for minimization direction would be the way to go. But when I see, I don’t have to.
What are we missing?
For a loss-function, the value at each point must be computed.
You can compute them all and "look at" the surface and just directly choose the lowest - that is called a grid search.
For high dimensions, there's just way too many "points" to compute.
Gradient descent: take a step in the steepest downward direction. Look around and repeat. When you reach a level area, how do you know you are at the lowest point?
It's just that when the function is implemented on the computer, evaluating so many points takes a long time, and using a more sophisticated optimization algorithm that exploits information like the gradient is almost always faster. In physical reality all the points already exist, so if they can be observed cheaply the brute force approach works well.
Edit: Your question was good. Asking superficially-naive questions like that is often a fruitful starting point for coming up with new tricks to solve seemingly-intractable problems.
It does feels to me that we do some sort of sampling, definitely is not a naive grid search.
Also I find it easier to find the minima in specific directions (up, down, left, right) rather than let’s say a 42 degree one. So some sort of priors are probably used to improve sample efficiency.
It is not perfect though, see the many optical illusions.
But we follow gradients all the time, consciously or not. You know you are at the bottom of the hole when all the paths go up for instance.