Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Datasaurus: Never trust summary statistics; visualize your data (2016) (thefunctionalart.com)
58 points by tosh on Aug 27, 2020 | hide | past | favorite | 14 comments


The datasaurus is an interesting concept, but I think the biggest issue with most of our society is basic mathmatical fluency. This idea of "visualizing your data" or "sanity checking" can be expanded to a lot of situations outside of statistics.

In software engineering and electrical engineering this comes up all the time. Does it make sense for this log file to be 5 GB? Is it really 10 amps going into this shunt resistor? Being able to "sanity check" things mentally is an important skill.


The dinosaur drawing is a little gimmicky or "so what do you do with this?". Far better and more applicable to real life is the lesson in Anscombe's quartet: https://en.wikipedia.org/wiki/Anscombe%27s_quartet

It's all the more important to have a trained skepticism of stats when every day you hear things like, "we just had the greatest quarter of economic growth in history!"


There's a dinosaur version of Anscome's Quartet:

https://www.autodeskresearch.com/publications/samestats


oh, very nice!


> It's all the more important to be trained in stats

There, fixed that for you ;)


Eyeballing 2d plot is important statistical tool.

If it looks like a blob, correlation coefficient is usually meaningless. You can instantly see if the relationship is linear, quadratic, piecewise linear or quadratic, mixture, clustered or even dinosaur.


Except scatter plots do not handle occlusion. If you have a lot of data to plot, the blob shape might be outliers.


This might be a problem, but if you're careful to make use of transparency, axis scaling and random jitter (for integer or categorical values), the occlusion issue can be overcome.


* in some cases


Eyballing detects outliers as well. It can do density estimation.


my scatters do. Transparency, dodge, etc.


"Never trust summary statistics" is a misnomer. Data is lost in the creation of summary statistics, that's basically a tautology. His example could look like a fuzzy line, or a dino, or a cross, or anything else. These facts don't refute the usefulness of summary statistics, however. At issue here is the competency and good faith of the ANALYST.

If the summary statistics are chosen and presented in a way that is useful, in context, and not misleading, then there is nothing wrong with omitting the full data visualization. If some of these pieces are missing though, then yeah, check the data.


Whenever I see single number statistics thrown around in news articles, press releases, ads, etc., I'm generally pretty skeptical. I want to see the distribution. Is it Gaussian? Bimodal? Flat? Apparently random?

The shape of the data tells you a lot that isn't captured in a single number like a mean or median.

Also, what's the sample size? What was the methodology used to acquire the data?

Lots of statistics fall apart when you look at how the data was collected.


Matt Parker (famous from his YouTube channel stand-up maths and frequent guest on numberphile) did a video on it and visited the inventor

https://youtu.be/iwzzv1biHv8




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: