The entire product I built over the last year can be reduced to basic statistics...

Bartweiss · on Dec 12, 2016

It's interesting to me that with all the ML hype, it's still not clear what constitutes ML. A basic k-means or naive Bayes approach will show up in ML textbooks, but those aren't clearly different from "use some statistics to make a prediction".

There's an interesting group of marginal approaches that have existed as-is for years, but have increasingly focused their branding on machine learning as its profile has risen.

jorgemf · on Dec 12, 2016

> but those aren't clearly different from "use some statistics to make a prediction"

You can reduce 90% of ML to this. Even neural networks are based on statistics.

If I have to draw a line between statistics and ML is that ML learns, it means it can predict things, however statistics only gives you information about the data you have. But for sure statistics and ML overlap a lot.

Bartweiss · on Dec 12, 2016

Even that doesn't seem like a clear distinction?

If you ask me for the most likely new value for a dataset, I won't know. But if I graph a few things and then write a function to spit back the current mean or median, is that machine learning?

I'm not trying to be snarky there, I agree that the bulk of ML tools are fundamentally just statistical tricks with some layer of abstraction. As a result, I have a lot of trouble knowing how much abstraction justifies the ML title. I see some people using "statistics to produce unintuitive solutions" as a standard, but that just begs that we ask unintuitive to who?

BickNowstrom · on Dec 12, 2016

I feel like it is foremost a matter of attitude of the practitioner. An applied statistician and a machine learning engineer may deliver exactly the same end product, just the reasoning and assumptions differ. Machine learning uses little to no assumptions, where statisticians do. I also feel that machine learning engineers have a bit less fear of building black boxes.

Caruana showed the cartoon of the difference between a statistician and a machine learning practitioner by showing a cliff. The statistician carefully inches to the edge, stomping her feet to see if the ground is still stable, then 10 meters before the edge she stops and draws her conclusions. The machine learning practitioner dives headfirst from the cliff, with a parachute that reads "cross-validation".

See also:

http://norvig.com/chomsky.html On Chomsky and the Two Cultures of Statistical Learning.

And http://projecteuclid.org/euclid.ss/1009213726 Statistical Modeling: The Two Cultures by Leo Breiman.

and this joke:

> Norvig teamed up with a Stanford statistician to prove that statisticians, data scientists and mathematicians think the same way. They hypothesized that, if they all received the same dataset, worked on it, and came back together, they’d find they all independently used the same techniques. So, they got a very large dataset and shared it between them.

> Norvig used the whole dataset and built a complex predictive model. The statistician took a 1% sample of the dataset, discarded the rest, and showed that the data met certain assumptions.

> The mathematician, believe it or not, didn’t even look at the dataset. Rather, he proved the characteristics of various formulas that could (in theory) be applied to the data.

jorgemf · on Dec 12, 2016

> Even that doesn't seem like a clear distinction?

Obviously no, ML uses statistics as statistics uses Maths. But not all ML uses statistics, some algorithms are biological inspired (swarm optimization) other uses theory of information for classification.

The point of ML is you learn something from data, not necessarily with statistics, although it is used in a lot of algorithms. But also function optimization is used in a lot of algorithms. The boundaries are very fuzzy, but for sure not all ML uses statistics and not all statistics are ML.

marcosdumay · on Dec 12, 2016

> it means it can predict things

All the other Math areas call that kind of prediction by "interpolation". It's not a magical property that only ML has.

I'd draw the line by the name. An algorithm is ML if it includes the computer deriving a complex model based on data gathered on the field.

jorgemf · on Dec 12, 2016

ML and statistics are a subset of maths. As I said the statistics and ML overlap and also function interpolation. But some ML algorithms are based on biological systems (like swarm optimization), or theory of information.

If you have a problem that you want to classify some vectors, you have different ways to do it. You call all of them ML, but some use statistics, others use interpolation, other uses theory of information, etc. The model doesn't have to be complex or require a lot of data. Instead of saying all the different techniques you sum up saying ML.

tnecniv · on Dec 12, 2016

Indeed.

A lot of it went over the head because I don't know much classical statistics, but I read some articles by stats people that basically boiled down to the distinction not being in the techniques but in common assumptions, rigor, culture, etc.

collyw · on Dec 12, 2016

Predicting things seems to be the primary purpose of statistics in many cases.

jorgemf · on Dec 12, 2016

I don't think so. I think it is more similar to this description: https://www.isixsigma.com/tools-templates/sampling-data/stat...

rsrsrs86 · on Dec 13, 2016

I'd say describing uncertain process and measures is. If you have a good description you might be able to predict values as well.

yummyfajitas · on Dec 18, 2016

The best distinction I've found between ML and statistics is the following.

Statistics is about modelling the underlying probability distribution that generates your data. A convergence/generalization/etc result will usually be dependent on this underlying distribution.

ML is when you don't care much about the underlying distribution (modulo regularity assumptions), and your model doesn't even come from the same family at all.

I.e. linear regression is usually statistics, because you often believe the underlying data looks like f(x) ~ f(x0) + f'(x0)(x-x0)). Random forests are machine learning because you don't actually think the real world secretly has a random forest flotaing around.

mdemare · on Dec 13, 2016

> A basic k-means or naive Bayes approach will show up in ML textbooks, but those aren't clearly different from "use some statistics to make a prediction".

Ha, sounds like a classification problem! Let's use ML to find the boundary.

nl · on Dec 12, 2016

ML = anything where parameters are learnt from data.

Yes, this means ML is "just" statistics - the distinction being that it is automated so you can run it on larger amounts of data quickly.

I thought this was pretty much an accepted definition.

perfmode · on Dec 13, 2016

An accepted definition:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." - Tom Mitchell

jorgemf · on Dec 12, 2016

> ML = anything where parameters are learnt from data.

In some ML algorithms you don't learn parameters. For example: some clustering algorithms are based on examples, not on parameters.

> Yes, this means ML is "just" statistics

So, a decision tree based on information theory would you call it statistics? Information theory and statistics are not clearly the same.

> I thought this was pretty much an accepted definition.

Machine Learning: A machine that learns (regardless it uses statistics, information theory, function optimization, biological inspiration or whatever)

nl · on Dec 13, 2016

Sure, I think minimization/maximization techniques (eg, as used in clustering and tree based learners) are generally regarded as ML too.

rakmial · on Dec 12, 2016

Actually there's a very clear definition of what types of problems ML ought be used for, and that category of problem is what defines it. Those familiar with regression (and stats in general) ought to be familiar with it already - it's an issue of relationship of datatype between independent and dependent variable.

In brief, you're going to run up against two types of data - categorical and continuous. (There are facets to this, eg ordinal, but these are really the elemental types of data). The relationship of datatype to independent/dependent variable is what determines what kind of analysis you may conduct.

Categorical Independent vs. Categorical Dependent, for example, is fairly restrictive, as makes logical sense. You may cross-tabulate, you may score likelihood based on previous observation, but obviously, because all of the data involved are non-numeric, there's no chance for regression, ANOVA, etc. Linear Regression is used when both independent and dependent variables are continuous, and cross-category differencing techniques like ANOVA may be used when the independent is categorical and the dependent is continuous.

The part you don't typically learn until grad school is when the independent is continuous and the dependent is categorical, ie, in ML, a classification problem. The standard statistical methods used as foundation for these problems are logistic regression, logit/probit. It's expansion of these methods that lead to ML in the first place.

shmageggy · on Dec 13, 2016

If I'm reading this correctly, it's just wrong. Whatever the distinction between data analysis and ML might be, it is more than just whether your data and predicted quantities are discrete or continuous.

> Categorical Independent vs. Categorical Dependent, for example, is fairly restrictive, as makes logical sense. You may cross-tabulate, you may score likelihood based on previous observation, but obviously, because all of the data involved are non-numeric, there's no chance for regression, ANOVA, etc

If you are implying that categorical -> categorical predictions are not ML: as a counter example, natural language is a categorical (words) input that could be used to predict any number of categorical variables (parse trees, semantic categories, etc). I think it's safe to say that the field of NLP is doing machine learning.

Bartweiss · on Dec 13, 2016

Thanks for the sanity check. I read that reply, and got bogged down enough that I was worried my initial reaction of "what, that's not relevant!" was born of ignorance. Discrete/continuous is a distinction worth making, but as a hidden 'definition' for ML I really don't understand it.

nl · on Dec 13, 2016

http://www.stat.uchicago.edu/~lekheng/courses/191f09/breiman...

sidlls · on Dec 12, 2016

This mirrors my own thoughts on the matter. Especially as regards the "branding" issue.

AndrewKemendo · on Dec 12, 2016

The value added isn't in the use of ML techniques itself, it's in the hype train that fills the Valley these days: our customers see "Data Science product" and don't get that it's really basic predictive analytics under the hood. I'm not sure the product would actually sell as well as it does without that labeling.

So you are misleading your customers through omission? This is the kind of thing that makes people question anyone stating they are using ML. Those of us actually implementing ML techniques (aka training neural nets and automating processes with data) are met with unnecessary skepticism as a result.

edit: OP clarified his position since this post so take that into account when reading.

sidlls · on Dec 12, 2016

No, we actually use ML. We just don't need to, in my opinion, because the problems our products solve are more or less solvable without these techniques.

My point was that using ML, even though we don't need to, "adds value" by virtue of the hype train. We need ML to sell products, not to create them.

I do agree that this sort of arrangement lends itself to supporting skepticism around AI and ML. On the other hand I don't think that's a bad thing.

AndrewKemendo · on Dec 12, 2016

Got it. Thanks for the clarification. It is true that people are using ML where other, simpler options are available, but I wouldn't immediate discount the value of using nets for your problem. I don't know enough about your problem/implementation to speak to it really.

sidlls · on Dec 12, 2016

Yes, thanks for highlighting the deficiency in my original post. I can see it is easily interpretated as you did. I added a clarification (or what I hope is one).

lbhnact · on Dec 12, 2016

I know a few companies riding the ml/nn train. Any chance you are based in NYC?

sidlls · on Dec 12, 2016

No, in the bay area. The hype is strong here.