We lived in a rural area when I was a kid. My dad told me once that his buddy had to measure the ptarmigan[1] population in the mountains each year as part of his job.
He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.
The total count was submitted to some office which used it to estimate the population.
One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.
However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.
Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".
The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...
This is a really nicely written and illustrated post.
An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: https://richardstartin.github.io/posts/reservoir-sampling
samwho · 1h ago
Hello! o/
I’m the author of this post. Happy to answer any questions, and love to get feedback.
Does this method compose with itself? E.g. if I implement reservoir sampling in my service and then the log collector service implements reservoir sampling, is the result the same as if only the log collector implemented it?
malwrar · 1h ago
Love your website’s design, I find all of interactivity, the dog character as an “audience”, and even the font/color/layout wonderful. Loved the article too!
samwho · 1h ago
Thank you so much!
The dogs on the playing cards were commissioned just for this post. They’re all made by the wonderful https://www.andycarolan.com/.
It would've been easy to just use green for the held card and red for the discard pile.
Thank you for using a colour-blind friendly palette; as someone with deuteranopia :)
rdtsc · 1h ago
Well done, I really like the animations and the explanation. Especially the case where it's a graph and we can drag ahead or click "shuffle 100 times"
One thing that threw me for a bit is when it switched from the intro of picking 3 cards at random from
a deck of 10 or 436,234 to picking just one card. It's seems as if it almost needs a section heading before "Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random?" indicating that now we're switching to a simplifying assumption that we're holding only 1 card not 3, but we also don't know the size of the deck.
pandaxtc · 51m ago
This is awesome, I loved the interactivity!
istjohn · 46m ago
Does your blog have an RSS feed?
stygiansonic · 1h ago
Great article and nice explanation. I believe this describes “Algorithm R” in this paper from Vitter, who was probably the first to describe it: https://www.cs.umd.edu/~samir/498/vitter.pdf
hinkley · 42m ago
This reminds me that I need to spend more time thinking about the algorithm the allies used to count German tanks by serial number. The people in the field estimated about 5x as many tanks as were actually produced but the serial number trick was over 90% accurate.
This is a great post that also illustrates the tradeoffs inherent in telemetry collection (traces, logs, metrics) for analysis. It's a capital-H Hard space to operate in that a lot of developers either don't know about, or take for granted.
wood_spirit · 1h ago
I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.
owyn · 1h ago
Yeah, this was a google interview question for me too. I didn't know the algorithm and floundered around trying to solve the problem. I came up with the 1/n and k/n selection strategy but still didn't get the job lol. I think the guy who interviewed me was just killing time until lunch.
I like the visualizations in this article, really good explanation.
dekhn · 1h ago
I didn't know about the algorithm until after I got hired there. It's actually really useful in a number of contexts, but my favorite was using it to find optimal split points for sharding lexicographically sorted string keys for mapping. Often you will have a sorted table, but the underlying distribution of keys isn't known, so uniform sharding will often cause imbalances where some mappers end up doing far more work than others. I don't know if there is a convenient open source class to do this.
wood_spirit · 41m ago
Interesting idea, hadn’t that about that way to apply it.
I knew it from before my interview from a turbo pascal program I had seen that sampled dat tape backups of patient records from a hospital system. These samples were used for studies. That was a textbook example of it’s utility.
dekhn · 18m ago
I guess the question in my mind is: would you expect a smart person who did not previously know this problem (or really much random sampling at all) to come up with the algorithm on the fly in an interview? And if the person had seen it before and memorized the answer, does that provide any signal of their ability to code?
pixelbeat · 32m ago
FWIW GNU coreutils' shuf uses reservoir sampling for larger inputs to give bounded memory operation
foxbee · 48m ago
Wonderful illustrations and writing. Real interesting read.
He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.
The total count was submitted to some office which used it to estimate the population.
One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.
However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.
Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".
The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...
[1]: https://en.wikipedia.org/wiki/Rock_ptarmigan
An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: https://richardstartin.github.io/posts/reservoir-sampling
I’m the author of this post. Happy to answer any questions, and love to get feedback.
The code for all of my posts can be found at https://github.com/samwho/visualisations and is MIT licensed, so you’re welcome to use it :)
The dogs on the playing cards were commissioned just for this post. They’re all made by the wonderful https://www.andycarolan.com/.
The colour palette is the Wong palette that I learned about from https://davidmathlogic.com/colorblind/.
Oh, and you can pet the dogs. :)
Thank you for using a colour-blind friendly palette; as someone with deuteranopia :)
One thing that threw me for a bit is when it switched from the intro of picking 3 cards at random from a deck of 10 or 436,234 to picking just one card. It's seems as if it almost needs a section heading before "Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random?" indicating that now we're switching to a simplifying assumption that we're holding only 1 card not 3, but we also don't know the size of the deck.
I like the visualizations in this article, really good explanation.
I knew it from before my interview from a turbo pascal program I had seen that sampled dat tape backups of patient records from a hospital system. These samples were used for studies. That was a textbook example of it’s utility.