Datasaurus: Never trust summary statistics; visualize your data (2016)

roland35 · on Aug 27, 2020

The datasaurus is an interesting concept, but I think the biggest issue with most of our society is basic mathmatical fluency. This idea of "visualizing your data" or "sanity checking" can be expanded to a lot of situations outside of statistics.

In software engineering and electrical engineering this comes up all the time. Does it make sense for this log file to be 5 GB? Is it really 10 amps going into this shunt resistor? Being able to "sanity check" things mentally is an important skill.

supernova87a · on Aug 27, 2020

The dinosaur drawing is a little gimmicky or "so what do you do with this?". Far better and more applicable to real life is the lesson in Anscombe's quartet: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

It's all the more important to have a trained skepticism of stats when every day you hear things like, "we just had the greatest quarter of economic growth in history!"

1wheel · on Aug 27, 2020

There's a dinosaur version of Anscome's Quartet:

https://www.autodeskresearch.com/publications/samestats

supernova87a · on Aug 27, 2020

oh, very nice!

conjectures · on Aug 27, 2020

> It's all the more important to be trained in stats

There, fixed that for you ;)

nabla9 · on Aug 27, 2020

Eyeballing 2d plot is important statistical tool.

If it looks like a blob, correlation coefficient is usually meaningless. You can instantly see if the relationship is linear, quadratic, piecewise linear or quadratic, mixture, clustered or even dinosaur.

jdonaldson · on Aug 27, 2020

Except scatter plots do not handle occlusion. If you have a lot of data to plot, the blob shape might be outliers.

Petefine · on Aug 27, 2020

This might be a problem, but if you're careful to make use of transparency, axis scaling and random jitter (for integer or categorical values), the occlusion issue can be overcome.

jdonaldson · on Aug 27, 2020

* in some cases

nabla9 · on Aug 28, 2020

Eyballing detects outliers as well. It can do density estimation.

SubiculumCode · on Aug 27, 2020

my scatters do. Transparency, dodge, etc.

hammock · on Aug 27, 2020

"Never trust summary statistics" is a misnomer. Data is lost in the creation of summary statistics, that's basically a tautology. His example could look like a fuzzy line, or a dino, or a cross, or anything else. These facts don't refute the usefulness of summary statistics, however. At issue here is the competency and good faith of the ANALYST.

If the summary statistics are chosen and presented in a way that is useful, in context, and not misleading, then there is nothing wrong with omitting the full data visualization. If some of these pieces are missing though, then yeah, check the data.

justin_oaks · on Aug 27, 2020

Whenever I see single number statistics thrown around in news articles, press releases, ads, etc., I'm generally pretty skeptical. I want to see the distribution. Is it Gaussian? Bimodal? Flat? Apparently random?

The shape of the data tells you a lot that isn't captured in a single number like a mean or median.

Also, what's the sample size? What was the methodology used to acquire the data?

Lots of statistics fall apart when you look at how the data was collected.

wodenokoto · on Aug 27, 2020

Matt Parker (famous from his YouTube channel stand-up maths and frequent guest on numberphile) did a video on it and visited the inventor

https://youtu.be/iwzzv1biHv8