Important machine learning equations

272 sebg 27 8/28/2025, 11:38:44 AM chizkidd.github.io ↗

Comments (27)

dkislyuk · 8h ago

Presenting information theory as a series of independent equations like this does a disservice to the learning process. Cross-entropy and KL-divergence are directly derived from information entropy, where InformationEntropy(P) represents the baseline number of bits needed to encode events from the true distribution P, CrossEntropy(P, Q) represents the (average) number of bits needed for encoding P with a suboptimal distribution Q, and KL-Divergence (better referred to as relative entropy) is the difference between these two values (how many more bits are needed to encode P with Q, i.e. quantifying the inefficiency):

relative_entropy(p, q) = cross_entropy(p, q) - entropy(p)

Information theory is some of the most accessible and approachable math for ML practitioners, and it shows up everywhere. In my experience, it's worthwhile to dig into the foundations as opposed to just memorizing the formulas.

(bits assume base 2 here)

morleytj · 6h ago

I 100% agree.

I think Shannon's Mathematical Theory of Communication is so incredibly well written and accessible that anyone interested in information theory should just start with the real foundational work rather than lists of equations, it really is worth the time to dig into it.

golddust-gecko · 5h ago

Agree 100% with this. It gives the illusion of understanding, like when a precocious 6 year old learns the word "precocious" and feels smart because they have can say it. Or any movie with tech or science with <technical speak>.

cl3misch · 11h ago

In the entropy implementation:

    return -np.sum(p * np.log(p, where=p > 0))

Using `where` in ufuncs like log results in the output being uninitialized (undefined) at the locations where the condition is not met. Summing over that array will return incorrect results for sure.

Better would be e.g.

    return -np.sum((p * np.log(p))[p > 0])

Also, the cross entropy code doesn't match the equation. And, as explained in the comment below the post, Ax+b is not a linear operation but affine (because of the +b).

Overall it seems like an imprecise post to me. Not bad, but not stringent enough to serve as a reference.

jpcompartir · 11h ago

I would echo some caution if using as a reference, as in another blog the writer states:

"Backpropagation, often referred to as “backward propagation of errors,” is the cornerstone of training deep neural networks. It is a supervised learning algorithm that optimizes the weights and biases of a neural network to minimize the error between predicted and actual outputs.."

https://chizkidd.github.io/2025/05/30/backpropagation/

backpropagation is a supervised machine learning algorithm, pardon?

cl3misch · 11h ago

I actually see this a lot: confusing backpropagation with gradient descent (or any optimizer). Backprop is just a way to compute the gradients of the weights with respect to the cost function, not an algorithm to minimize the cost function wrt. the weights.

I guess giving the (mathematically) simple principle of computing a gradient with the chain rule the fancy name "backpropagation" comes from the early days of AI where the computers were much less powerful and this seemed less obvious?

imtringued · 10h ago

The German Wikipedia article makes the same mistake and it is quite infuriating.

cubefox · 10h ago

What does this comment have to do with the previous comment, which talked about supervised learning?

cl3misch · 9h ago

The previous comment highlights an example where backprop is confused with "a supervised learning algorithm".

My comment was about "confusing backpropagation with gradient descent (or any optimizer)."

For me the connection is pretty clear? The core issue is confusing backprop with minimization. The cited article mentioning supervised learning specifically doesn't take away from that.

imtringued · 9h ago

Reread the comment

"Backprop is just a way to compute the gradients of the weights with respect to the cost function, not an algorithm to minimize the cost function wrt. the weights."

What does the word supervised mean? It's when you define a cost function to be the difference between the training data and the model output.

Aka something like (f(x)-y)^2 which is simply the quadratic difference between the result of the model given an input x from the training data and the corresponding label y.

A learning algorithm is an algorithm that produces a model given a cost function and in the case of supervised learning, the cost function is parameterized with the training data.

The most common way to learn a model is to use an optimization algorithm. There are many optimization algorithms that can be used for this. One of the simplest algorithms for the optimization of unconstrained non-linear functions is stochastic gradient descent.

It's popular because it is a first order method. First order methods only use the first partial derivative known as the gradient whose size is equal to the number of parameters. Second order methods converge faster, but they need the Hessian, whose size scales with the square of the to be optimized parameters.

How do you calculate the gradient? Either you calculate each partial derivative individually, or you use the chain rule and work backwards to calculate the complete gradient.

I hope this made it clear that your question is exactly backwards. The referenced blog is about back propagation and unnecessarily mentions supervised learning when it shouldn't have done that and you're the one now sticking with supervised learning even though the comment you're responding to told you exactly why it is inappropriate to call back propagation a supervised learning algorithm.

cgadski · 10h ago

> This blog post has explored the most critical equations in machine learning, from foundational probability and linear algebra to advanced concepts like diffusion and attention. With theoretical explanations, practical implementations, and visualizations, you now have a comprehensive resource to understand and apply ML math. Point anyone asking about core ML math here—they’ll learn 95% of what they need in one place!

It makes me sad to see LLM slop on the front page.

maerch · 10h ago

Apart from the “—“, what else gives it away? Just asking from a non-native perspective.

Romario77 · 10h ago

It's just too bombastic for what it is - listing some equations with brief explanation and implementation.

If you don't know these things on some level already the post doesn't give you too much (far from 95%), it's a brief reference of some of the formulas used in machine learning/AI.

random3 · 5h ago

Slop brings back memories of literature teachers red-marking my "bombastic" terms in primary school essays

TFortunato · 10h ago

This is probably not going to be a very helpful answer, but I sort of think of it this way: you probably have favorite authors or artist (or maybe some really dislike!), where you could probably take a look at a piece of their work, even if its new to you, and immediately recognize their voice & style.

A lot of LLM chat models have a very particular voice and style they use by default, especially in these longer form "Sure, I can help you write a blog article about X!" type responses. Some pieces of writing just scream "ChatGPT wrote this", even if they don't include em-dashes, hah!

TFortunato · 10h ago

OK, on reflection, there are a few things,

Kace's response is absolutely right that the summaries tend to be a place where there is a big giveaway.

There is also something about the way they use "you" and the article itself... E.g. the "you now have a comprehensive resource to understand and apply ML math. Point anyone asking about core ML math here..." bit. This isn't something you would really expect to read in a human written article. It's a ChatBot presenting it's work to "you", the single user it's conversing with, not an author addressing their readers. Even if you ask the bot to write you an article for a blog, a lot of times it's response tends to mix in these chatty bits that address the user or directly references to the users questions / prompts in some way, which can be really jarring when transferred to a different medium w/o some editing

kace91 · 10h ago

Not op, but it is very clearly the final summary telling the user that the post they asked the AI to write is now created.

nxobject · 4h ago

Three things come to mind:

- bold-face item headers (eg “Practical Significance:”)

- lists of complex descriptors non-technical parts of the writing (“ With theoretical explanations, practical implementations, and visualizations”)

- the cheery, optimistic note that underlines a goal plausibly derived from a prompt. (eg “ Let’s dive into the equations that power this fascinating field!”)

cgadski · 10h ago

It's not really about the language. If someone doesn't speak English well and wants to use a model to translate it, that's cool. What I'm picking up on is the dishonesty and vapidness. The article _doesn't_ explore linear algebra, it _doesn't_ have visualizations, it's _not_ a comprehensive resource, and reading this won't teach you anything beyond keywords and formulas.

What makes me angry about LLM slop is imagining how this looks to a student learning this stuff. Putting a post like this on your personal blog is implicitly saying: as long as you know some some "equations" and remember the keywords, a language model can do the rest of the thinking for you! It's encouraging people to forgo learning.

dawnofdusk · 9h ago

I have some minor complaints but overall I think this is great! My background is in physics, and I remember finally understanding every equation on the formula sheet given to us for exams... that really felt like I finally understood a lot of physics. There's great value in being comprehensive so that a learner can choose themselves to dive deeper, and for those with more experience to check their own knowledge.

Having said that, let me raise some objections:

1. Omitting the multi-layer perceptron is a major oversight. We have backpropagation here, but not forward propagation, so to speak.

2. Omitting kernel machines is a moderate oversight. I know they're not "hot" anymore but they are very mathematically important to the field.

3. The equation for forward diffusion is really boring... it's not that important that you can take structured data and add noise incrementally until it's all noise. What's important is that in some sense you can (conditionally) reverse it. In other words, you should put the reverse diffusion equation which of course is considerably more sophisticated.

TrackerFF · 2h ago

Kind of weird not to see β̂ = (XᵀX)⁻¹Xᵀy

bob1029 · 11h ago

MSE remains my favorite distance measure by a long shot. Its quadratic nature still helps even in non-linear problem spaces where convexity is no longer guaranteed. When working with generic/raw binary data where hamming distance would be theoretically more ideal, I still prefer MSE over byte-level values because of this property.

Other fitness measures take much longer to converge or are very unreliable in the way in which they bootstrap. MSE can start from a dead cold nothing on threading the needle through 20 hidden layers and still give you a workable gradient in a short period of time.

bee_rider · 11h ago

Are eigenvalues or singular values used much in the popular recent stuff, like LLMs?

calebkaiser · 10h ago

LoRa uses singular value decomposition to get the low rank matrices. In different optimizers, you'll also see eigendecomposition or some approximation used (I think Shampoo does something like this, but it's been a while).

roadside_picnic · 6h ago

While this very much looks like AI slop, it does remind me of a wonderful little book (which has many more equations): Formulas Useful for Linear Regression Analysis and Related Matrix Theory - It's Only Formulas But We Like Them [0]

That book is pretty much what it says on the cover, but can be useful as a reference given it's pretty thorough coverage. Though, in all honesty, I mostly purchased it due to the outrageous title.

0. https://link.springer.com/book/10.1007/978-3-642-32931-9

nxobject · 4h ago

Finally, a handy reference to more matrix decompositions and normal/canonical forms than I ever realized I wanted to know!

0wis · 9h ago

Currently improving my foundation in data preparation for ML, this short and right article is a gem.