We lived in a rural area when I was a kid. My dad told me once that his buddy had to measure the ptarmigan[1] population in the mountains each year as part of his job.
He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.
The total count was submitted to some office which used it to estimate the population.
One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.
However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.
Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".
The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...
Another interesting direction you can take reservoir sampling is instead of drawing a random number for each item (to see whether it replaces an existing item and which one), you generate a number from a geometric distribution telling you how many items you can safely skip before the next replacement.
That's especially interesting, if you can skip many items cheaply. Eg because you can fast forward on your tape drive (but you don't know up front how long your tape is), or because you send almost your whole system to sleep during skips.
For n items to sample from, this system does about O(k * log (n/k)) samples and skips.
Conceptually, I prefer the version of reservoir sampling that has you generate a fixed random 'priority' for each card as it arrives, and then you keep the top k items by priority around. That brings me to another related interesting algorithmic problem: selecting the top k items out of a stream of elements of unknown length in O(n) time and O(k) space. Naive approaches to reaching O(k) space will give you O(n log k) time, eg if you keep a min heap around.
What you can do instead is keep an unordered buffer of capacity up to 2k. As each item arrives, you add it to the buffer. When your buffer is full, you prune it to the top k element in O(k) with eg randomised quickselect or via median-of-medians. You do that O(2k) work every k elements for n elements total, given you the required O(n) = O(n * 2*k / k) runtime.
I actually read that post on the alias method just the other day and was blown away. I think I’d like to try making a post on it. Wouldn’t be able to add anything that link hasn’t already said, but I think I can make it more accessible.
fiddlerwoaroof · 4h ago
Does this method compose with itself? E.g. if I implement reservoir sampling in my service and then the log collector service implements reservoir sampling, is the result the same as if only the log collector implemented it?
NoahZuniga · 2h ago
Yes
samwho · 2h ago
I hadn’t considered this, cool to know it works!
eru · 1h ago
Though I think it's only strictly true, if the intervals you sample over are the same. Eg they both sample some messages every second, and the all start their second-long intervals on the same nanosecond (or close enough).
I find it easier to reason about reservoir sampling in an alternative formulation: the article talks about flipping a random (biased) coin for each arrival. Instead we can re-interpret reservoir sampling as assigning a random priority to each item, and then keeping the items with the top k priority.
It's fairly easy to see in this reformulation whether specific combinations of algorithms would compose: you only need to think about whether they would still select the top k items by priority.
malwrar · 5h ago
Love your website’s design, I find all of interactivity, the dog character as an “audience”, and even the font/color/layout wonderful. Loved the article too!
samwho · 5h ago
Thank you so much!
The dogs on the playing cards were commissioned just for this post. They’re all made by the wonderful https://www.andycarolan.com/.
Just noticed the physics simulator at the top is interactive. Then I was stacking squares on top of each other to see how tall I could make it, and started throwing things at it angry birds style. Fun stuff.
samwho · 2h ago
Something no one seems to have realised yet is that the hero simulation at the top of the page is using reservoir sampling to colour 3 of the shapes black.
lol768 · 3h ago
It would've been easy to just use green for the held card and red for the discard pile.
Thank you for using a colour-blind friendly palette; as someone with deuteranopia :)
samwho · 2h ago
You're welcome! I think it's a beautiful palette, and I think people have come to associate me with it now so I don't think I'll ever change.
I view all of my posts using the various colour blindness filters in the Chrome dev tools during development, to make sure I'm not using any ambiguous pairings. I'm glad that effort made you feel welcome and able to enjoy the content fully.
nightpool · 3h ago
I loved the graphics!
However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?
For example, if I want to optimize my code, and I need to know which endpoints are using the most time across the entire fleet to optimize my total costs (CPU-seconds or whatever), this would be an inappropriate method to use, since endpoints that get bursty traffic would be disproportionally underrepresented compared to endpoints that get steady constant traffic. So I'd end up wasting my time working on endpoints that don't actually get a lot of traffic.
Or if I'm trying to plan capacity for different services, and I want to know how many nodes to be running for each service, services that get bursty traffic would be underrepresented as well, correct?
What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?
eru · 1h ago
> However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?
Yes, of course.
You can fix this problem, however. There are (at least) two ways:
You can do an alternative interpretation and implementation of reservoir sampling: for each item you generate and store a random priority as it comes into the system. For each interval (eg each second) you keep the top k items by priority. If you want to aggregate multiple intervals, you keep the top k (or less) items over the intervals.
This will automatically deal with dealing all items the same, whether they arrived during busy or non-busy periods.
An alternative view of the same approach doesn't store any priorities, but stores the number of dropped items each interval. You can then do some arithmetic to tell you how to combine samples from different intervals; very similar to what's in the article.
> What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?
Anything you can do on any unbiased sample? Or are you talking about the specific variant in the article where you do reservoir sampling afresh each second?
samwho · 2h ago
Good question. I'm not sure how suitable this would be to then do statistical analysis on what remains. You'd likely want to try and aggregate at source, so you're considering all data and then only sending up aggregates to save on space/bandwidth (if you were at the sort of scale that would require that).
The use-case I chose in the post was more focusing on protecting some centralised service while making sure when you do throw things away, you're not doing it in a way that creates blind-spots (e.g. you pick a rate limit of N per minute and your traffic is inherently bursty around the top of the minute and you never see logs for anything in the tail end of the minute.)
A fun recent use-case you might have seen was in https://onemillionchessboards.com. Nolen uses reservoir sampling to maintain a list of boards with recent activity. I believe he is in the process of doing a technical write-up that'll go into more detail.
TheAlchemist · 1h ago
Very nice post - thank you. This is how maths and stats should be taught.
Well done, I really like the animations and the explanation. Especially the case where it's a graph and we can drag ahead or click "shuffle 100 times"
One thing that threw me for a bit is when it switched from the intro of picking 3 cards at random from
a deck of 10 or 436,234 to picking just one card. It's seems as if it almost needs a section heading before "Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random?" indicating that now we're switching to a simplifying assumption that we're holding only 1 card not 3, but we also don't know the size of the deck.
Nezteb · 3h ago
I loved the "Sometimes the hand of fate must be forced" comment!
samwho · 2h ago
Recovering WoW addict. :)
glial · 2h ago
This is really beautiful design, and excellent teaching. Thank you!
This is a really nicely written and illustrated post.
An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: https://richardstartin.github.io/posts/reservoir-sampling
justanotheratom · 3h ago
Great article and explanation.
On a practical level though, this would be the last thing I would use for log collection. I understand that when there is a spike, something has to be dropped. What should this something be?
I don't see the point of being "fair" about what is dropped.
I would use fairness as a last resort, after trying other things:
Drop lower priority logs: If your log messages have levels (debug, info, warning, error), prioritize higher-severity events, discarding the verbose/debug ones first.
Contextual grouping: Treat a sequence of logs as parts of an activity. For a successful activity, maybe record only the start and end events (or key state changes) and leave out repetitive in-between logs.
Aggregation and summarization: Instead of storing every log line during a spike, aggregate similar or redundant messages into a summarized entry. This not only reduces volume but also highlights trends.
The article addressed this. In fact, you don't typically want to throw away all of the low priority logs ... you just want to limit them to a budget. And you want to limit the total number of log lines collected to a super budget.
From data science perspective, the volume of the data also encodes really valuable information, so it’s good to also log the number of data points each one represents. For example, if sampling rate comes out to be 10%, have a field that encodes 10. This way you can rebuild and estimate most statistics like count, sum, average, etc.
stygiansonic · 5h ago
Great article and nice explanation. I believe this describes “Algorithm R” in this paper from Vitter, who was probably the first to describe it: https://www.cs.umd.edu/~samir/498/vitter.pdf
fanf2 · 1h ago
That paper says “Algorithm R (which is a reservoir algorithm due to Alan Waterman)” but it doesn’t have a citation. Vitter’s previous paper https://dl.acm.org/doi/10.1145/358105.893 cites Knuth TAOCP vol 2. Knuth doesn’t have a citation.
hinkley · 4h ago
This reminds me that I need to spend more time thinking about the algorithm the allies used to count German tanks by serial number. The people in the field estimated about 5x as many tanks as were actually produced but the serial number trick was over 90% accurate.
It seems like it could have some utility in places where hyperloglog isn’t quite right. YouTube recommendations pointed me at a Numberphile video on this a couple weeks ago:
This is a great post that also illustrates the tradeoffs inherent in telemetry collection (traces, logs, metrics) for analysis. It's a capital-H Hard space to operate in that a lot of developers either don't know about, or take for granted.
samwho · 2h ago
Something I've considered writing about in the past is how sampling affects the shape of lines on graphs. Render the same underlying data with different sampling strategies and show how the resulting graph can look extremely different depending on the strategy used. I think it's an underappreciated thing a lot of people don't think about when looking at their observability tools.
phillipcarter · 15m ago
Yeah it’s challenging. I work for such a tool and we re-weight counts which is generally the right move, but comes with its own subtleties like when you are looking for exact counts specifically to tune sampling, or your MoE is bad for the particular calculation and granularity of data.
Observability: easily one of the more underestimated fields in computing.
There's also a distributed version, easy with a map reduce.
Or the very simple algorithm: generate a random paired for each item in the stream and keep the top N ordered by that random.
lordnacho · 1h ago
I discovered this in one of those coding quizzes they give you to get a job. I was reviewing questions and one of them was this exact thing. I had no idea how to do it until I read the answer, and then it was obvious.
wood_spirit · 5h ago
I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.
owyn · 5h ago
Yeah, this was a google interview question for me too. I didn't know the algorithm and floundered around trying to solve the problem. I came up with the 1/n and k/n selection strategy but still didn't get the job lol. I think the guy who interviewed me was just killing time until lunch.
I like the visualizations in this article, really good explanation.
dekhn · 5h ago
I didn't know about the algorithm until after I got hired there. It's actually really useful in a number of contexts, but my favorite was using it to find optimal split points for sharding lexicographically sorted string keys for mapping. Often you will have a sorted table, but the underlying distribution of keys isn't known, so uniform sharding will often cause imbalances where some mappers end up doing far more work than others. I don't know if there is a convenient open source class to do this.
wood_spirit · 4h ago
Interesting idea, hadn’t that about that way to apply it.
I knew it from before my interview from a turbo pascal program I had seen that sampled dat tape backups of patient records from a hospital system. These samples were used for studies. That was a textbook example of it’s utility.
dekhn · 4h ago
I guess the question in my mind is: would you expect a smart person who did not previously know this problem (or really much random sampling at all) to come up with the algorithm on the fly in an interview? And if the person had seen it before and memorized the answer, does that provide any signal of their ability to code?
samwho · 2h ago
My gut instinct is no. I certainly don't think I'd be able to derive this algorithm from first principles in a 60 minute whiteboarding interview, and I worked at Google for 4 years.
pixelbeat · 4h ago
FWIW GNU coreutils' shuf uses reservoir sampling for larger inputs to give bounded memory operation
foxbee · 4h ago
Wonderful illustrations and writing. Real interesting read.
He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.
The total count was submitted to some office which used it to estimate the population.
One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.
However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.
Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".
The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...
[1]: https://en.wikipedia.org/wiki/Rock_ptarmigan
I’m the author of this post. Happy to answer any questions, and love to get feedback.
The code for all of my posts can be found at https://github.com/samwho/visualisations and is MIT licensed, so you’re welcome to use it :)
Another interesting direction you can take reservoir sampling is instead of drawing a random number for each item (to see whether it replaces an existing item and which one), you generate a number from a geometric distribution telling you how many items you can safely skip before the next replacement.
That's especially interesting, if you can skip many items cheaply. Eg because you can fast forward on your tape drive (but you don't know up front how long your tape is), or because you send almost your whole system to sleep during skips.
For n items to sample from, this system does about O(k * log (n/k)) samples and skips.
Conceptually, I prefer the version of reservoir sampling that has you generate a fixed random 'priority' for each card as it arrives, and then you keep the top k items by priority around. That brings me to another related interesting algorithmic problem: selecting the top k items out of a stream of elements of unknown length in O(n) time and O(k) space. Naive approaches to reaching O(k) space will give you O(n log k) time, eg if you keep a min heap around.
What you can do instead is keep an unordered buffer of capacity up to 2k. As each item arrives, you add it to the buffer. When your buffer is full, you prune it to the top k element in O(k) with eg randomised quickselect or via median-of-medians. You do that O(2k) work every k elements for n elements total, given you the required O(n) = O(n * 2*k / k) runtime.
Another related topic is rendezvous hashing: https://en.wikipedia.org/wiki/Rendezvous_hashing
Tangentially related: https://www.keithschwarz.com/darts-dice-coins/ is a great write-up on the alias method for sampling from a discrete random distribution.
I find it easier to reason about reservoir sampling in an alternative formulation: the article talks about flipping a random (biased) coin for each arrival. Instead we can re-interpret reservoir sampling as assigning a random priority to each item, and then keeping the items with the top k priority.
It's fairly easy to see in this reformulation whether specific combinations of algorithms would compose: you only need to think about whether they would still select the top k items by priority.
The dogs on the playing cards were commissioned just for this post. They’re all made by the wonderful https://www.andycarolan.com/.
The colour palette is the Wong palette that I learned about from https://davidmathlogic.com/colorblind/.
Oh, and you can pet the dogs. :)
Thank you for using a colour-blind friendly palette; as someone with deuteranopia :)
I view all of my posts using the various colour blindness filters in the Chrome dev tools during development, to make sure I'm not using any ambiguous pairings. I'm glad that effort made you feel welcome and able to enjoy the content fully.
However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?
For example, if I want to optimize my code, and I need to know which endpoints are using the most time across the entire fleet to optimize my total costs (CPU-seconds or whatever), this would be an inappropriate method to use, since endpoints that get bursty traffic would be disproportionally underrepresented compared to endpoints that get steady constant traffic. So I'd end up wasting my time working on endpoints that don't actually get a lot of traffic.
Or if I'm trying to plan capacity for different services, and I want to know how many nodes to be running for each service, services that get bursty traffic would be underrepresented as well, correct?
What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?
Yes, of course.
You can fix this problem, however. There are (at least) two ways:
You can do an alternative interpretation and implementation of reservoir sampling: for each item you generate and store a random priority as it comes into the system. For each interval (eg each second) you keep the top k items by priority. If you want to aggregate multiple intervals, you keep the top k (or less) items over the intervals.
This will automatically deal with dealing all items the same, whether they arrived during busy or non-busy periods.
An alternative view of the same approach doesn't store any priorities, but stores the number of dropped items each interval. You can then do some arithmetic to tell you how to combine samples from different intervals; very similar to what's in the article.
> What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?
Anything you can do on any unbiased sample? Or are you talking about the specific variant in the article where you do reservoir sampling afresh each second?
The use-case I chose in the post was more focusing on protecting some centralised service while making sure when you do throw things away, you're not doing it in a way that creates blind-spots (e.g. you pick a rate limit of N per minute and your traffic is inherently bursty around the top of the minute and you never see logs for anything in the tail end of the minute.)
A fun recent use-case you might have seen was in https://onemillionchessboards.com. Nolen uses reservoir sampling to maintain a list of boards with recent activity. I believe he is in the process of doing a technical write-up that'll go into more detail.
Reminds me a bit about https://distill.pub/
Was very sad when they announced their hiatus. Made me nervous about the viability of this sort of content.
You may also enjoy https://pudding.cool.
One thing that threw me for a bit is when it switched from the intro of picking 3 cards at random from a deck of 10 or 436,234 to picking just one card. It's seems as if it almost needs a section heading before "Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random?" indicating that now we're switching to a simplifying assumption that we're holding only 1 card not 3, but we also don't know the size of the deck.
An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: https://richardstartin.github.io/posts/reservoir-sampling
On a practical level though, this would be the last thing I would use for log collection. I understand that when there is a spike, something has to be dropped. What should this something be?
I don't see the point of being "fair" about what is dropped.
I would use fairness as a last resort, after trying other things:
Drop lower priority logs: If your log messages have levels (debug, info, warning, error), prioritize higher-severity events, discarding the verbose/debug ones first.
Contextual grouping: Treat a sequence of logs as parts of an activity. For a successful activity, maybe record only the start and end events (or key state changes) and leave out repetitive in-between logs.
Aggregation and summarization: Instead of storing every log line during a spike, aggregate similar or redundant messages into a summarized entry. This not only reduces volume but also highlights trends.
Reservoir sampling can handle all of that.
https://youtube.com/watch?v=WLCwMRJBhuI
Observability: easily one of the more underestimated fields in computing.
There's also a distributed version, easy with a map reduce.
Or the very simple algorithm: generate a random paired for each item in the stream and keep the top N ordered by that random.
I like the visualizations in this article, really good explanation.
I knew it from before my interview from a turbo pascal program I had seen that sampled dat tape backups of patient records from a hospital system. These samples were used for studies. That was a textbook example of it’s utility.