Ask HN: Why hasn't x86 caught up with Apple M series?
431 points by stephenheron 3d ago 614 comments
Ask HN: Best codebases to study to learn software design?
103 points by pixelworm 4d ago 90 comments
Important machine learning equations
272 sebg 27 8/28/2025, 11:38:44 AM chizkidd.github.io ↗
relative_entropy(p, q) = cross_entropy(p, q) - entropy(p)
Information theory is some of the most accessible and approachable math for ML practitioners, and it shows up everywhere. In my experience, it's worthwhile to dig into the foundations as opposed to just memorizing the formulas.
(bits assume base 2 here)
I think Shannon's Mathematical Theory of Communication is so incredibly well written and accessible that anyone interested in information theory should just start with the real foundational work rather than lists of equations, it really is worth the time to dig into it.
Better would be e.g.
Also, the cross entropy code doesn't match the equation. And, as explained in the comment below the post, Ax+b is not a linear operation but affine (because of the +b).Overall it seems like an imprecise post to me. Not bad, but not stringent enough to serve as a reference.
"Backpropagation, often referred to as “backward propagation of errors,” is the cornerstone of training deep neural networks. It is a supervised learning algorithm that optimizes the weights and biases of a neural network to minimize the error between predicted and actual outputs.."
https://chizkidd.github.io/2025/05/30/backpropagation/
backpropagation is a supervised machine learning algorithm, pardon?
I guess giving the (mathematically) simple principle of computing a gradient with the chain rule the fancy name "backpropagation" comes from the early days of AI where the computers were much less powerful and this seemed less obvious?
My comment was about "confusing backpropagation with gradient descent (or any optimizer)."
For me the connection is pretty clear? The core issue is confusing backprop with minimization. The cited article mentioning supervised learning specifically doesn't take away from that.
"Backprop is just a way to compute the gradients of the weights with respect to the cost function, not an algorithm to minimize the cost function wrt. the weights."
What does the word supervised mean? It's when you define a cost function to be the difference between the training data and the model output.
Aka something like (f(x)-y)^2 which is simply the quadratic difference between the result of the model given an input x from the training data and the corresponding label y.
A learning algorithm is an algorithm that produces a model given a cost function and in the case of supervised learning, the cost function is parameterized with the training data.
The most common way to learn a model is to use an optimization algorithm. There are many optimization algorithms that can be used for this. One of the simplest algorithms for the optimization of unconstrained non-linear functions is stochastic gradient descent.
It's popular because it is a first order method. First order methods only use the first partial derivative known as the gradient whose size is equal to the number of parameters. Second order methods converge faster, but they need the Hessian, whose size scales with the square of the to be optimized parameters.
How do you calculate the gradient? Either you calculate each partial derivative individually, or you use the chain rule and work backwards to calculate the complete gradient.
I hope this made it clear that your question is exactly backwards. The referenced blog is about back propagation and unnecessarily mentions supervised learning when it shouldn't have done that and you're the one now sticking with supervised learning even though the comment you're responding to told you exactly why it is inappropriate to call back propagation a supervised learning algorithm.
It makes me sad to see LLM slop on the front page.
If you don't know these things on some level already the post doesn't give you too much (far from 95%), it's a brief reference of some of the formulas used in machine learning/AI.
A lot of LLM chat models have a very particular voice and style they use by default, especially in these longer form "Sure, I can help you write a blog article about X!" type responses. Some pieces of writing just scream "ChatGPT wrote this", even if they don't include em-dashes, hah!
Kace's response is absolutely right that the summaries tend to be a place where there is a big giveaway.
There is also something about the way they use "you" and the article itself... E.g. the "you now have a comprehensive resource to understand and apply ML math. Point anyone asking about core ML math here..." bit. This isn't something you would really expect to read in a human written article. It's a ChatBot presenting it's work to "you", the single user it's conversing with, not an author addressing their readers. Even if you ask the bot to write you an article for a blog, a lot of times it's response tends to mix in these chatty bits that address the user or directly references to the users questions / prompts in some way, which can be really jarring when transferred to a different medium w/o some editing
- bold-face item headers (eg “Practical Significance:”)
- lists of complex descriptors non-technical parts of the writing (“ With theoretical explanations, practical implementations, and visualizations”)
- the cheery, optimistic note that underlines a goal plausibly derived from a prompt. (eg “ Let’s dive into the equations that power this fascinating field!”)
What makes me angry about LLM slop is imagining how this looks to a student learning this stuff. Putting a post like this on your personal blog is implicitly saying: as long as you know some some "equations" and remember the keywords, a language model can do the rest of the thinking for you! It's encouraging people to forgo learning.
Having said that, let me raise some objections:
1. Omitting the multi-layer perceptron is a major oversight. We have backpropagation here, but not forward propagation, so to speak.
2. Omitting kernel machines is a moderate oversight. I know they're not "hot" anymore but they are very mathematically important to the field.
3. The equation for forward diffusion is really boring... it's not that important that you can take structured data and add noise incrementally until it's all noise. What's important is that in some sense you can (conditionally) reverse it. In other words, you should put the reverse diffusion equation which of course is considerably more sophisticated.
Other fitness measures take much longer to converge or are very unreliable in the way in which they bootstrap. MSE can start from a dead cold nothing on threading the needle through 20 hidden layers and still give you a workable gradient in a short period of time.
That book is pretty much what it says on the cover, but can be useful as a reference given it's pretty thorough coverage. Though, in all honesty, I mostly purchased it due to the outrageous title.
0. https://link.springer.com/book/10.1007/978-3-642-32931-9