Anscombe's Quartet

48 gidellav 14 9/8/2025, 9:29:36 AM en.wikipedia.org ↗

Comments (14)

__mharrison__ · 11m ago
I teach curve fitting with this dataset and recently added the fifth dataset. It illustrates Simpsons paradox.

https://www.linkedin.com/posts/panela_loved-adding-ancombes-...

INGELRII · 10m ago
Always visualize first. Human 'eyballing' is a good pattern detector.

Linear correlation is just one pattern the data can have.

Unfortunately many social science publications have reviewers who know only the basics and can't judge or accept statistically valid analysis that is outside their competence. Fit it into line or nothing.

flpm · 2h ago
And check this one, which is a generalization of the Datasaurus where you can define your own shapes :D

https://github.com/stefmolin/data-morph

moi2388 · 1h ago
From now on I won’t trust any statistic unless I can transform it into a panda.
jihadjihad · 1h ago
Often there is little or no substitute for plotting the data to see how it is distributed. A scatter plot, histogram, density plot, etc. is almost always going to tell you a "story" about the data that the summary stats will have compressed.

But sometimes you are at the mercy of the data and your visualization of choice. Box plots, for example, are great at showing more than just how the data is centered, but it is possible to encounter situations where the box plots of the data remain static while the underlying data is clearly changing [0].

As always it is good to know about these things and continue to add to the arsenal (violin plots, in the example above) of tools and intuition needed to tease out the story behind the data.

0: https://www.research.autodesk.com/publications/same-stats-di...

djoldman · 2h ago
sunrunner · 2h ago
Content warning: This is a baker’s dozen not a regular dozen, in case anyone clicks through expecting to find twelve and is mildly and briefly perturbed.
djoldman · 2h ago
The scary thing is that yea we can see these in 2D and maybe 3D. But ...

usually there are more than 2 or 3 columns in our data :(

imurray · 1h ago
It's clearly hard, but there are tools for doing exploratory visualization of high-dim data. GGobi http://ggobi.org/ and all the ones that arrange points but try to get local neighborhoods correct (t-sne, umap, et al.).
efavdb · 2h ago
The example shows that the usual stats aren't enough to pin down the true data. But in practice I imagine / wonder if these stats really are reasonable "sufficient stats" because the probability of seeing data with strong structure is unlikely in most contexts. In other words...

p(data | stats) = p(stats | data) * p(data) / p(stats).

and p(data) is only strong for a "blob / cloud" of points, so when there's some correlation the observed stats tell you that you likely have a blob having some degree of correlation.

aredox · 30m ago
>But in practice I imagine / wonder if these stats really are reasonable "sufficient stats" because the probability of seeing data with strong structure is unlikely in most contexts.

We just spent five years since COVID appeared to argue about statistics, with tons of bad analysis of very complicated data fuelling political rage up to this day.

The US health secretary is currently using data with "strong structure" to deny vaccines and to falsely pin down convenient targets for everything from cancer to autism.

dejj · 2h ago
ryukoposting · 1h ago
I do STEM mentoring for high school kids. Bookmarking this, because it'll be a great teaching aid at some point.
throw0101d · 2h ago
Thought this would be about the 'other' Anscombe:

* https://en.wikipedia.org/wiki/G._E._M._Anscombe

:)