Show HN: Made NZ's member of parliament financial disclosure data searchable (open-register-of-pecuniary-interests.joshmcarthur.com)

We lived in a rural area when I was a kid. My dad told me once that his buddy had to measure the ptarmigan[1] population in the mountains each year as part of his job.

He did this by hiking a fixed route, and at fixed intervals scare the birds so they would fly and count.

The total count was submitted to some office which used it to estimate the population.

One year he had to travel abroad when the counting had to be done, so he recruited a friend and explained in detail how to do it.

However when the day of the counting arrived his friend forgot, and it was a huge hassle anyway so he just submitted a number he figured was about right, and that was that.

Then one day the following year, the local newspaper had a frontpage headline stating "record increase in ptarmigan population".

The reason it was big news was that the population estimate was used to set the hunting quotas, something his friend had not considered...

[1]: https://en.wikipedia.org/wiki/Rock_ptarmigan

codr7 · 132d ago

Never trust statistics.

I once worked on a reservation system for some pretty big ski resorts.

We were running late, working nights, and one of the last things we had to finish was the official statistics reports about number of guest nights etc that gets published by the government.

Lets just say that the statistics that year had little to do with reality.

amw-zero · 132d ago

You’re confusing statistics with forecasting. We can and should trust statistics. We should just never trust their relation to future behavior.

Scrutiny6707 · 130d ago

>We can and should trust statistics. We should just never trust their relation to future behavior.

99% of people I've ever met believe statistics and forecasting are synonyms.

Mostly because it gets used that way in a political sense.

swiftcoder · 132d ago

Fairly sure the previous poster is describing statistics that were made up, rather than measured/sampled. Those are, for hopefully obvious reasons, not very trustworthy

stefs · 131d ago

so, the actual lesson to be learned was more along the lines of "never trust lies".

codr7 · 131d ago

Sure, but in this case the lies ended up packaged as the official tourism statistics report from the government, and I'm pretty sure it was neither the first nor last time that happened.

samwho · 132d ago

Hello! o/

I’m the author of this post. Happy to answer any questions, and love to get feedback.

The code for all of my posts can be found at https://github.com/samwho/visualisations and is MIT licensed, so you’re welcome to use it :)

eru · 132d ago

Very nice post!

Another interesting direction you can take reservoir sampling is instead of drawing a random number for each item (to see whether it replaces an existing item and which one), you generate a number from a geometric distribution telling you how many items you can safely skip before the next replacement.

That's especially interesting, if you can skip many items cheaply. Eg because you can fast forward on your tape drive (but you don't know up front how long your tape is), or because you send almost your whole system to sleep during skips.

For n items to sample from, this system does about O(k * log (n/k)) samples and skips.

Conceptually, I prefer the version of reservoir sampling that has you generate a fixed random 'priority' for each card as it arrives, and then you keep the top k items by priority around. That brings me to another related interesting algorithmic problem: selecting the top k items out of a stream of elements of unknown length in O(n) time and O(k) space. Naive approaches to reaching O(k) space will give you O(n log k) time, eg if you keep a min heap around.

What you can do instead is keep an unordered buffer of capacity up to 2k. As each item arrives, you add it to the buffer. When your buffer is full, you prune it to the top k element in O(k) with eg randomised quickselect or via median-of-medians. You do that O(2k) work every k elements for n elements total, given you the required O(n) = O(n * 2*k / k) runtime.

Another related topic is rendezvous hashing: https://en.wikipedia.org/wiki/Rendezvous_hashing

Tangentially related: https://www.keithschwarz.com/darts-dice-coins/ is a great write-up on the alias method for sampling from a discrete random distribution.

samwho · 132d ago

I actually read that post on the alias method just the other day and was blown away. I think I’d like to try making a post on it. Wouldn’t be able to add anything that link hasn’t already said, but I think I can make it more accessible.

eru · 132d ago

I have a few more topics we could cooperate on, if you are interested.

https://claude.ai/public/artifacts/62d0d742-3316-421b-9a7b-d... has a 'very static' visualisation of sorting algorithms. Basically, we have a 2d plane, and we colour a pixel (x, y) black iff the sorting algorithm compares x with y when it runs. It's a resurrection (with AI) of an older project I was coding up manually at https://github.com/matthiasgoergens/static-sorting-visualisa...

I'm also working on making https://cs.stackexchange.com/q/56643/50292 with its answer https://cs.stackexchange.com/a/171695/50292 more accessible. It's a little algorithmic problem I've been working on: 'simulate' a heap in O(n) time. I'm also developing a new, really simple implementation of soft heaps. And on my write-up for the solution to https://github.com/matthiasgoergens/TwoTimePad/blob/master/d...

> I actually read that post on the alias method just the other day and was blown away. I think I’d like to try making a post on it. Wouldn’t be able to add anything that link hasn’t already said, but I think I can make it more accessible.

If memory serves right, they don't do much about how you can efficiently support changes to your discrete probability distribution.

samwho · 132d ago

I appreciate the offer (and your contributions in the comments here!) but collaborations are very difficult for me atm. Most of the work I do on these posts I do when I can steal time away from other aspects of my life, which can sometimes take weeks. I wouldn’t be a dependable collaboration partner.

eru · 132d ago

No worries.

I'd mostly just appreciate a beta tester / beta reader.

samwho · 132d ago

Totally happy to do that! You’ll find where to contact me on my homepage. :)

smusamashah · 131d ago

I made a tool to visualize sorting algos https://xosh.org/VisualizingSorts/sorting.html where you can put your own algo too if you like.

ncruces · 131d ago

I love the idea behind that sorting visualization, and found it extremely useful to validate the properties of my Quicksort implementation.

https://github.com/ncruces/sort

eru · 131d ago

That's interesting. Alas, it only works for in-place sorting algorithms (and it's also an animation).

tmoertel · 132d ago

A while ago I tried to create a more self-explanatory implementation:

https://github.com/tmoertel/practice/blob/master/libraries%2...

It is limited to integer weights only to make it easy to verify that the algorithm implements the requested distribution exactly. (See the test file in the same directory.)

eru · 132d ago

You could probably restrict to rational numbers, and still verify? Languages like Python, Haskell, Rust etc have good support for arbitrary length rational numbers.

Each floating point number is also a rational number, and thus you could then restrict again to floating point afterwards.

dan-robertson · 132d ago

Alias tables are neat and not super well known. We used to have an interview question around sampling from a weighted distribution (typical answer: prefix sum -> binary search) and I don’t think anyone produced this. I like the explanation in that blog. The way it was explained to me was first ‘imagine drawing a bar chart and throwing a dart at it, retrying if you miss. This simulates the distribution but runs in expected linear time’. Then you can describe how to chop up the bars to fit in the rectangle you would get if all weights were equal. Proof that the greedy algorithm works is reasonably straightforward.

eru · 132d ago

I'm not actually sure this makes for a good interview question. Doesn't it mostly just test whether you've heard of the alias method?

Btw, a slightly related question:

Supposed you have a really long text file, how would you randomly sample a line? Such that all lines in the text file have the exactly same probability. Ideally, you want to do this without spending O(size of file) time preprocessing.

(I don't think this is a good interview question, but it is an interesting question.)

One way: sample random characters until you randomly hit a newline. That's the newline at the end of your line.

dan-robertson · 131d ago

It’s a retired question (so I’m not really disagreeing that it wasn’t very good), and no one was expected to get the alias tables (if they did, just ask for updateable weights) and in fact there isn’t even much point in telling people about them as they can then get the impression they failed the interview. The point is more to get some kind of binary search and understanding of probability.

The Monte Carlo method you propose probably works for files where there are many short lines but totally fails in the degenerate case of one very long line. It also may not really work that well in practice because most of the cost of reading a random byte is reading a big block from disk, and you could likely scan such a block in ram faster than you could do the random read of the block from disk.

hansvm · 132d ago

That's exactly the blog post that clicked when I put my alias method [0] together. Their other writing is delightful as well.

[0] https://github.com/hmusgrave/zalias It's nothing special, just an Array-of-Struct-of-Array implementation so that biases and aliases are always in the same cache line.

jononor · 131d ago

Skipping like that is very interesting in battery-powered sensor systems, where you can put the system to sleep until it is time to sample.

fiddlerwoaroof · 132d ago

Does this method compose with itself? E.g. if I implement reservoir sampling in my service and then the log collector service implements reservoir sampling, is the result the same as if only the log collector implemented it?

NoahZuniga · 132d ago

Yes

samwho · 132d ago

I hadn’t considered this, cool to know it works!

eru · 132d ago

Though I think it's only strictly true, if the intervals you sample over are the same. Eg they both sample some messages every second, and the all start their second-long intervals on the same nanosecond (or close enough).

I find it easier to reason about reservoir sampling in an alternative formulation: the article talks about flipping a random (biased) coin for each arrival. Instead we can re-interpret reservoir sampling as assigning a random priority to each item, and then keeping the items with the top k priority.

It's fairly easy to see in this reformulation whether specific combinations of algorithms would compose: you only need to think about whether they would still select the top k items by priority.

fiddlerwoaroof · 132d ago

The second formulation sounds easier to use to adapt to specific use cases too: just bump the priority of a message based on your business rules to make it more likely that interesting events get to your log database.

eru · 132d ago

You could do (category, random priority) and then do lexicographic comparison. That way higher categories always outrank lower categories.

But depending on what you need, you might also just do (random priority + weight * category) or so. Or you just keep separate reservoirs for high importance items and for everything else.

BobaFloutist · 131d ago

I would expect any way to get a truly fair sample from a truly fair sample would necessarily result in a truly fair sample. I can't imagine how it could possibly not.

eru · 131d ago

You are dropping a lot of context here.

In the first instance, every second we get a 'truly fair' random sample from all the messages in that second.

Going from there to eg a 'truly fair' random sample from all the messages in a minute is not trivial. And it's not even possible just from the samples, without auxiliary information.

fiddlerwoaroof · 131d ago

I always find the interaction between probability distributions a little surprising.

rdtsc · 132d ago

Well done, I really like the animations and the explanation. Especially the case where it's a graph and we can drag ahead or click "shuffle 100 times"

One thing that threw me for a bit is when it switched from the intro of picking 3 cards at random from a deck of 10 or 436,234 to picking just one card. It's seems as if it almost needs a section heading before "Now let me throw you a curveball: what if I were to show you 1 card at a time, and you had to pick 1 at random?" indicating that now we're switching to a simplifying assumption that we're holding only 1 card not 3, but we also don't know the size of the deck.

malwrar · 132d ago

Love your website’s design, I find all of interactivity, the dog character as an “audience”, and even the font/color/layout wonderful. Loved the article too!

samwho · 132d ago

Thank you so much!

The dogs on the playing cards were commissioned just for this post. They’re all made by the wonderful https://www.andycarolan.com/.

The colour palette is the Wong palette that I learned about from https://davidmathlogic.com/colorblind/.

Oh, and you can pet the dogs. :)

simonw · 132d ago

I love the commissioned art! Found this post with more details: https://samwho.dev/dogs/

zerd · 132d ago

Just noticed the physics simulator at the top is interactive. Then I was stacking squares on top of each other to see how tall I could make it, and started throwing things at it angry birds style. Fun stuff.

samwho · 132d ago

Something no one seems to have realised yet is that the hero simulation at the top of the page is using reservoir sampling to colour 3 of the shapes black.

DonHopkins · 132d ago

So many nice touches that combine to be much more than the sum of the parts.

Doe's bandana is cool, your dogs must worship you for your commitment to them!

My only suggestion is a way to slow down or ^S the log to read the funny messages, since they were flying by so fast I could only get a glimpse, even with reservoir sampling.

something something "needs more emojis"! ;)

samwho · 132d ago

Doe’s bandana is my attempt at tasteful solidarity and support. Glad you noticed it!

I did consider a pause button for the logs but it felt too unsubtle, and distracts from the content of the post. You could argue the log messages are already distracting, but I really wanted my own take on “reticulating splines.”

You can read how the messages are constructed here: https://github.com/samwho/visualisations/blob/main/reservoir...

DonHopkins · 131d ago

I THOUGHT I saw a a Herman-Miller chair fly by! So I'm not going crazy. whew

lol768 · 132d ago

It would've been easy to just use green for the held card and red for the discard pile.

Thank you for using a colour-blind friendly palette; as someone with deuteranopia :)

samwho · 132d ago

You're welcome! I think it's a beautiful palette, and I think people have come to associate me with it now so I don't think I'll ever change.

I view all of my posts using the various colour blindness filters in the Chrome dev tools during development, to make sure I'm not using any ambiguous pairings. I'm glad that effort made you feel welcome and able to enjoy the content fully.

nightpool · 132d ago

I loved the graphics!

However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?

For example, if I want to optimize my code, and I need to know which endpoints are using the most time across the entire fleet to optimize my total costs (CPU-seconds or whatever), this would be an inappropriate method to use, since endpoints that get bursty traffic would be disproportionally underrepresented compared to endpoints that get steady constant traffic. So I'd end up wasting my time working on endpoints that don't actually get a lot of traffic.

Or if I'm trying to plan capacity for different services, and I want to know how many nodes to be running for each service, services that get bursty traffic would be underrepresented as well, correct?

What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?

eru · 132d ago

> However, I'm not sure I understand the statistical soundness of this approach. I get that every log during a given period has the same chance to be included, but doesn't that mean that logs that happen during "slow periods" are disproportionately overrepresented in overall metrics?

Yes, of course.

You can fix this problem, however. There are (at least) two ways:

You can do an alternative interpretation and implementation of reservoir sampling: for each item you generate and store a random priority as it comes into the system. For each interval (eg each second) you keep the top k items by priority. If you want to aggregate multiple intervals, you keep the top k (or less) items over the intervals.

This will automatically deal with dealing all items the same, whether they arrived during busy or non-busy periods.

An alternative view of the same approach doesn't store any priorities, but stores the number of dropped items each interval. You can then do some arithmetic to tell you how to combine samples from different intervals; very similar to what's in the article.

> What are the use-cases that reservoir sampling are good for? What kind of statistical analysis can you do on the data that's returned by it?

Anything you can do on any unbiased sample? Or are you talking about the specific variant in the article where you do reservoir sampling afresh each second?

samwho · 132d ago

Good question. I'm not sure how suitable this would be to then do statistical analysis on what remains. You'd likely want to try and aggregate at source, so you're considering all data and then only sending up aggregates to save on space/bandwidth (if you were at the sort of scale that would require that).

The use-case I chose in the post was more focusing on protecting some centralised service while making sure when you do throw things away, you're not doing it in a way that creates blind-spots (e.g. you pick a rate limit of N per minute and your traffic is inherently bursty around the top of the minute and you never see logs for anything in the tail end of the minute.)

A fun recent use-case you might have seen was in https://onemillionchessboards.com. Nolen uses reservoir sampling to maintain a list of boards with recent activity. I believe he is in the process of doing a technical write-up that'll go into more detail.

tambourine_man · 132d ago

Your whole blog seems like a treasure trove. I’m glad I found it, thanks for sharing.

Nezteb · 132d ago

I loved the "Sometimes the hand of fate must be forced" comment!

samwho · 132d ago

Recovering WoW addict. :)

TheAlchemist · 132d ago

Very nice post - thank you. This is how maths and stats should be taught.

Reminds me a bit about https://distill.pub/

samwho · 132d ago

I loved distill.pub. Usually didn’t fully grasp what the papers were about but they were beautiful and I usually got _something_ out of them.

Was very sad when they announced their hiatus. Made me nervous about the viability of this sort of content.

You may also enjoy https://pudding.cool.

purrcat259 · 132d ago

I just listened to your episode on fafo.fm a few days ago and recognised your handle and already knew the link was worth clicking. Your stuff is awesome!

pandaxtc · 132d ago

This is awesome, I loved the interactivity!

slig · 132d ago

It's so nice to read things written by someone who cares. Thank you so much for sharing!

istjohn · 132d ago

Does your blog have an RSS feed?

subract · 132d ago

Looks like it - https://samwho.dev/rss.xml

samwho · 132d ago

That's the one! Thank you.

Matumio · 132d ago

The post looks interesting, but I'm still building my triangle castle so I haven't read it yet. This probably means I should read it later when I'm less distracted. Anyway, already love the layout and the playful visualisations.

glial · 132d ago

This is really beautiful design, and excellent teaching. Thank you!

justanotheratom · 132d ago

Great article and explanation.

On a practical level though, this would be the last thing I would use for log collection. I understand that when there is a spike, something has to be dropped. What should this something be?

I don't see the point of being "fair" about what is dropped.

I would use fairness as a last resort, after trying other things:

Drop lower priority logs: If your log messages have levels (debug, info, warning, error), prioritize higher-severity events, discarding the verbose/debug ones first.

Contextual grouping: Treat a sequence of logs as parts of an activity. For a successful activity, maybe record only the start and end events (or key state changes) and leave out repetitive in-between logs.

Aggregation and summarization: Instead of storing every log line during a spike, aggregate similar or redundant messages into a summarized entry. This not only reduces volume but also highlights trends.

manmal · 132d ago

I’ve been down the observability rabbit hole recently, and what you’re describing is probably a mix of head and tail sampling: https://docs.honeycomb.io/manage-data-volume/sample/

justanotheratom · 132d ago

honeycomb seems quite mature, thanks.

ted_dunning · 132d ago

The article addressed this. In fact, you don't typically want to throw away all of the low priority logs ... you just want to limit them to a budget. And you want to limit the total number of log lines collected to a super budget.

Reservoir sampling can handle all of that.

HelloNurse · 132d ago

You should drop or consolidate some entries if you can, but then the important entries that remain can still be too many and require random culling because anything is better than choking.

Fair reservoir sampling can be made unfair in controlled ways (e.g. by increasing the probability of retaining an entry if its content is particularly interesting); it competes with less principled biased random (or less than random) selection algorithms as a technique of last resort.

sadiq · 132d ago

This is a really nicely written and illustrated post.

An advanced extension to this is that there are algorithms which calculate the number of records to skip rather than doing a trial per record. This has a good write-up of them: https://richardstartin.github.io/posts/reservoir-sampling

Lichtso · 132d ago

The Weighted Reservoir Sampling (WRS) variant is used in ReSTIR (spatiotemporal reservoir resampling for real-time ray tracing). Which is a stochastic light transport estimator with inbuilt spatiotemporal denoising.

A light transport estimator is trying to figure out how much light flows through a scene (https://en.wikipedia.org/wiki/Radiance). For that it has to integrate the radiance across all the possible paths light could take, while maintaining the conservation of energy (https://en.wikipedia.org/wiki/Rendering_equation).

In all but the most trivial cases this integral of the rendering equation has no tractable closed form solution and solving it is thus done stochastically. The very basic idea is the Monte Carlo method (https://en.wikipedia.org/wiki/Monte_Carlo_method): Randomly sample as many paths as you can and average them. From there more sophisticated sampling strategies were developed over the last decades:

- Importance Sampling (IS)

- Multiple Importance Sampling (MIS)

- Sample Importance Resampling (SIR)

- Resampled Importance Sampling (RIS)

- Weighted Reservoir Sampling (WRS)

- And finally combining RIS and WRS into ReSTIR

For a in depth read see: https://agraphicsguynotes.com/posts/understanding_the_math_b...

hinkley · 132d ago

This reminds me that I need to spend more time thinking about the algorithm the allies used to count German tanks by serial number. The people in the field estimated about 5x as many tanks as were actually produced but the serial number trick was over 90% accurate.

dekhn · 132d ago

https://en.wikipedia.org/wiki/German_tank_problem

hinkley · 132d ago

It seems like it could have some utility in places where hyperloglog isn’t quite right. YouTube recommendations pointed me at a Numberphile video on this a couple weeks ago:

https://youtube.com/watch?v=WLCwMRJBhuI

skykooler · 132d ago

An interesting corollary of this is that if you only have a single sample, it reduces to indicating that your sample is the median value - i.e. if you see one item with serial number N, you can guess that there were roughly 2N produced.

hinkley · 131d ago

You do have outliers though. Seal Team 6 is actually Seal Team 1 but they wanted people to think they were outnumbered.

stygiansonic · 132d ago

Great article and nice explanation. I believe this describes “Algorithm R” in this paper from Vitter, who was probably the first to describe it: https://www.cs.umd.edu/~samir/498/vitter.pdf

fanf2 · 132d ago

That paper says “Algorithm R (which is a reservoir algorithm due to Alan Waterman)” but it doesn’t have a citation. Vitter’s previous paper https://dl.acm.org/doi/10.1145/358105.893 cites Knuth TAOCP vol 2. Knuth doesn’t have a citation.

svat · 132d ago

Knuth also says that "Algorithm R is due to Alan G. Waterman", on TAOCP vol 2 page 144, just below "Algorithm R (Reservoir sampling)". This blog post seems to be a good history of the algorithm: https://markkm.com/blog/reservoir-sampling/ (it was given by Waterman in a letter to Knuth, as an improvement of Knuth's earlier "reservoir sampling" from the first edition).

> All in all, Algorithm R was known to Knuth and Waterman by 1975, and to a wider audience by 1981, when the second edition of The Art of Computer Programming volume 2 was published.

stygiansonic · 132d ago

Interesting! If Knuth is not the original author then they’ve been lost to the sands of time

tanvach · 132d ago

From data science perspective, the volume of the data also encodes really valuable information, so it’s good to also log the number of data points each one represents. For example, if sampling rate comes out to be 10%, have a field that encodes 10. This way you can rebuild and estimate most statistics like count, sum, average, etc.

gregable · 132d ago

Very well put together. If you are curious about the weighted version, I tried to explain it some here: https://gregable.com/2007/10/reservoir-sampling.html

There's also a distributed version, easy with a map reduce.

Or the very simple algorithm: generate a random paired for each item in the stream and keep the top N ordered by that random.

tmoertel · 132d ago

Two notes on the weighted version. First, the straightforward implementation of selecting the top N when ranked by POW(RANDOM(), 1.0 / weight) has stability problems when the weights are very large or very small. Second, the resulting sample does not have the same distribution in expectation as the population from which it was drawn. This is especially so when the overall weight is concentrated in a small number of population elements. But such samples are workable approximations in many cases.

I discuss these issues more here: https://blog.moertel.com/posts/2024-08-23-sampling-with-sql....

hansvm · 132d ago

This is a great post, very approachable, with excellent visualizations.

We use a variation of this sort of thing at $WORK to solve a related problem, where you want to estimate some percentile from a running stream, with the constraints that the percentile you want to choose changes from time to time but is generally static for a trillion or more iterations (and the underlying data is quasi-stationary). If you back the process by a splay tree, you can get amortized O(1) percentile estimates (higher error bars for a given RAM consumption than a number of other techniques, but very fast).

You can also play with the replacement probability to, e.g., have a "data half-life" (denominated either in time or discrete counts) and bias the estimate toward recent events, which is more suitable for some problems.

phillipcarter · 132d ago

This is a great post that also illustrates the tradeoffs inherent in telemetry collection (traces, logs, metrics) for analysis. It's a capital-H Hard space to operate in that a lot of developers either don't know about, or take for granted.

samwho · 132d ago

Something I've considered writing about in the past is how sampling affects the shape of lines on graphs. Render the same underlying data with different sampling strategies and show how the resulting graph can look extremely different depending on the strategy used. I think it's an underappreciated thing a lot of people don't think about when looking at their observability tools.

phillipcarter · 132d ago

Yeah it’s challenging. I work for such a tool and we re-weight counts which is generally the right move, but comes with its own subtleties like when you are looking for exact counts specifically to tune sampling, or your MoE is bad for the particular calculation and granularity of data.

Observability: easily one of the more underestimated fields in computing.

YZF · 132d ago

Sampling theorem.

It's interesting that people seem to think that sampling mathematics somehow applies to modems or RF but not to the data they are looking at. Things like aliasing absolutely matter for observability/telemetry.

Coeur · 131d ago

I love this kind of dynamic web interactive education. The other ones that come to mind are Bret Victor https://worrydream.com/ , Bartosz Ciechanowski https://ciechanow.ski/ , and Nicky Case has a nice page many more at https://explorabl.es/ .

pixelbeat · 132d ago

FWIW GNU coreutils' shuf uses reservoir sampling for larger inputs to give bounded memory operation

wood_spirit · 132d ago

I remember this turning up in a google interview back in the day. The interview was really expecting me not to know the algorithm and to flounder about trying to solve the problem from first principles. Was fun to just shortcut things by knowing the answer that time.

owyn · 132d ago

Yeah, this was a google interview question for me too. I didn't know the algorithm and floundered around trying to solve the problem. I came up with the 1/n and k/n selection strategy but still didn't get the job lol. I think the guy who interviewed me was just killing time until lunch.

I like the visualizations in this article, really good explanation.

dekhn · 132d ago

I didn't know about the algorithm until after I got hired there. It's actually really useful in a number of contexts, but my favorite was using it to find optimal split points for sharding lexicographically sorted string keys for mapping. Often you will have a sorted table, but the underlying distribution of keys isn't known, so uniform sharding will often cause imbalances where some mappers end up doing far more work than others. I don't know if there is a convenient open source class to do this.

wood_spirit · 132d ago

Interesting idea, hadn’t that about that way to apply it.

I knew it from before my interview from a turbo pascal program I had seen that sampled dat tape backups of patient records from a hospital system. These samples were used for studies. That was a textbook example of it’s utility.

dekhn · 132d ago

I guess the question in my mind is: would you expect a smart person who did not previously know this problem (or really much random sampling at all) to come up with the algorithm on the fly in an interview? And if the person had seen it before and memorized the answer, does that provide any signal of their ability to code?

samwho · 132d ago

My gut instinct is no. I certainly don't think I'd be able to derive this algorithm from first principles in a 60 minute whiteboarding interview, and I worked at Google for 4 years.

wood_spirit · 132d ago

They wanted to see your analytical thinking skills at work. To pass you only needed to be sensible. You didn’t fail the interview if you couldn’t invent reservoir sampling!

dekhn · 131d ago

uh, no, people would get a fail on the question if they didn't correctly identify both the initial selection and sample acceptance criteria.

petters · 132d ago

This also got me past one interview. I came up with k/n but now I think it's better to just generate a random float in [0, 1] and keep the k largest ones

tuzemec · 132d ago

Nice! This is almost in the same league with Bartosz Ciechanowski's articles - https://ciechanow.ski/

samwho · 131d ago

Bartosz is a huge inspiration to me, you can likely tell that there are plenty of things he does in his post that I'm emulating in mine. It's maybe the best compliment you can give me to compare my work to his. Thank you.

tomsonj · 131d ago

very cool - made me consider the contrasts with token-bucket rate limiting for log collection and stumbled across a interesting discussion https://github.com/open-telemetry/opentelemetry-specificatio...

lordnacho · 132d ago

I discovered this in one of those coding quizzes they give you to get a job. I was reviewing questions and one of them was this exact thing. I had no idea how to do it until I read the answer, and then it was obvious.

vismit2000 · 132d ago

The famous 4 mins explanation of reservoir sampling: https://www.youtube.com/watch?v=A1iwzSew5QY

jbellis · 132d ago

Used in Dropwizard Metrics, among other places. https://metrics.dropwizard.io/4.2.0/

WhitneyLand · 132d ago

Reminds me a bit of the Monte Hall problem, probability adjustment based on conditional information can lead to counterintuitive results.

foxbee · 132d ago

Wonderful illustrations and writing. Real interesting read.

Iwan-Zotow · 132d ago

Interesting

t55 · 132d ago

great article!

Show HN: The text disappears when you screenshot it (unscreenshottable.vercel.app)

Show HN: 47jobs – A Fiverr/Upwork for AI Agents (47jobs.xyz)

Show HN: Diamants, an open-source drone swarm to fight wildfires (github.com)

Show HN: Pgmcp, an MCP server to query any Postgres database in natural language (github.com)

Show HN: Made NZ's member of parliament financial disclosure data searchable (open-register-of-pecuniary-interests.joshmcarthur.com)

Show HN: A store that generates products from anything you type in search (anycrap.shop)

Show HN: A PSX/DOS style 3D game written in Rust with a custom software renderer (totenarctanz.itch.io)

Show HN: I built a platform for long-form media recs (books, articles, etc.) (rhomeapp.com)

Show HN: STT –> LLM –> TTS pipeline in C (github.com)

Show HN: I reverse engineered macOS to allow custom Lock Screen wallpapers (cindori.com)

Show HN: Small Transfers – charge from 0.000001 USD per request for your SaaS (smalltransfers.com)

Show HN: Pyproc – Call Python from Go Without CGO or Microservices (github.com)

Show HN: Daffodil – Open-Source Ecommerce Framework to connect to any platform (github.com)

Show HN: Chibi Izumi, staged dependency injection for Python (github.com)

Show HN: Omarchy on CachyOS (github.com)

Show HN: Dagger.js – A buildless, runtime-only JavaScript micro-framework (daggerjs.org)

Show HN: AI-powered web service combining FastAPI, Pydantic-AI, and MCP servers (github.com)

Show HN: MeldSecurity – Run Popular Security Tools in the Browser (Free) (meldsecurity.com)

Show HN: Semlib – Semantic Data Processing (github.com)

Show HN: A Cyberpunk Tuner (un.bounded.cc)

Show HN: OrderlyID – typed, time-sortable, 160-bit IDs with checksums (github.com)

Show HN: LLMyourself.com – Type a name. Get a report. (llmyourself.com)

Show HN: Vatify – Simple API for EU VAT validation and rate calculation (vatifytax.app)

Show HN: Web-based 2D geometry calculator (ccorcos.github.io)

Show HN: Vicinae – A native, Raycast-compatible launcher for Linux (github.com)

Show HN: I wrote a from-scratch OS to serve my blog (github.com)

Show HN: A GPT Realtime Web Game Where You Convince Aliens Not to Invade (gameorchard.beer)

Show HN: I made a generative online drum machine with ClojureScript (dopeloop.ai)

Show HN: MCP Server Installation Instructions Generator (hyprmcp.com)

Show HN: Term.everything – Run any GUI app in the terminal (github.com)

Show HN: Ghostpipe – Connect files in your codebase to user interfaces (github.com)

Show HN: AI Code Detector – detect AI-generated code with 95% accuracy (code-detector.ai)

Show HN: Datadef.io – Canvas for data lineage and metadata management (datadef.io)

Show HN: I built an app store for open-source financial plans (on spreadsheets) (finfam.app)

Show HN: Building a Deep Research Agent Using MCP-Agent (thealliance.ai)

Show HN: Coding AI Agent API for Developers (workser.ai)

Show HN: Ruminate – AI reading tool for understanding hard things (tryruminate.com)

Show HN: Drop-in Redis replacement in Rust with 5M+ GET/s (github.com)

Show HN: Quizquestions.org – A free library for quiz questions (quizquestions.org)

Show HN: TailGuard – Bridge your WireGuard router into Tailscale via a container (github.com)

Show HN: Ultraplot – A succint wrapper for matplotlib (github.com)

Show HN: Bottlefire – Build single-executable microVMs from Docker images (bottlefire.dev)

Show HN: CLAVIER-36 – A programming environment for generative music (clavier36.com)

Show HN: Blocks – Dream work apps and AI agents in minutes (blocks.diy)

Show HN: HuMo AI – Create Realistic Videos with Text, Image, and Audio Inputs (humoai.co)

Show HN: npm-daycare, an NPM proxy that filters out recent & small packages (github.com)

Show HN: Making a cross-platform game in Go using WebRTC Datachannels (pion.ly)

Show HN: Scientific Calculator for Android (play.google.com)

Show HN: Haystack – Review pull requests like you wrote them yourself (haystackeditor.com)

Show HN: C++ Compiler Support Page (cppstat.dev)

Reservoir Sampling

Comments (105)