The entire product I built over the last year can be reduced to basic statistics (e.g. ratios, probabilities) but because of the hype train we build "models" and "predict" certain outcomes over a data set.
One of the products the company I work for sells more or less attempts to find duplicate entries in a large, unclean data set with "machine learning."
The value added isn't in the use of ML techniques itself, it's in the hype train that fills the Valley these days: our customers see "Data Science product" and don't get that it's really basic predictive analytics under the hood. I'm not sure the product would actually sell as well as it does without that labeling.
To clarify: the company I work for actually uses ML. I actually work on the data science team at my company. My opinion is that we don't actually need to do these things, as our products are possible to create without the sophistication of even the basic techniques, but that battle was lost before I joined.
It's interesting to me that with all the ML hype, it's still not clear what constitutes ML. A basic k-means or naive Bayes approach will show up in ML textbooks, but those aren't clearly different from "use some statistics to make a prediction".
There's an interesting group of marginal approaches that have existed as-is for years, but have increasingly focused their branding on machine learning as its profile has risen.
> but those aren't clearly different from "use some statistics to make a prediction"
You can reduce 90% of ML to this. Even neural networks are based on statistics.
If I have to draw a line between statistics and ML is that ML learns, it means it can predict things, however statistics only gives you information about the data you have. But for sure statistics and ML overlap a lot.
If you ask me for the most likely new value for a dataset, I won't know. But if I graph a few things and then write a function to spit back the current mean or median, is that machine learning?
I'm not trying to be snarky there, I agree that the bulk of ML tools are fundamentally just statistical tricks with some layer of abstraction. As a result, I have a lot of trouble knowing how much abstraction justifies the ML title. I see some people using "statistics to produce unintuitive solutions" as a standard, but that just begs that we ask unintuitive to who?
I feel like it is foremost a matter of attitude of the practitioner. An applied statistician and a machine learning engineer may deliver exactly the same end product, just the reasoning and assumptions differ. Machine learning uses little to no assumptions, where statisticians do. I also feel that machine learning engineers have a bit less fear of building black boxes.
Caruana showed the cartoon of the difference between a statistician and a machine learning practitioner by showing a cliff. The statistician carefully inches to the edge, stomping her feet to see if the ground is still stable, then 10 meters before the edge she stops and draws her conclusions. The machine learning practitioner dives headfirst from the cliff, with a parachute that reads "cross-validation".
> Norvig teamed up with a Stanford statistician to prove that statisticians, data scientists and mathematicians think the same way. They hypothesized that, if they all received the same dataset, worked on it, and came back together, they’d find they all independently used the same techniques. So, they got a very large dataset and shared it between them.
> Norvig used the whole dataset and built a complex predictive model. The statistician took a 1% sample of the dataset, discarded the rest, and showed that the data met certain assumptions.
> The mathematician, believe it or not, didn’t even look at the dataset. Rather, he proved the characteristics of various formulas that could (in theory) be applied to the data.
> Even that doesn't seem like a clear distinction?
Obviously no, ML uses statistics as statistics uses Maths. But not all ML uses statistics, some algorithms are biological inspired (swarm optimization) other uses theory of information for classification.
The point of ML is you learn something from data, not necessarily with statistics, although it is used in a lot of algorithms. But also function optimization is used in a lot of algorithms. The boundaries are very fuzzy, but for sure not all ML uses statistics and not all statistics are ML.
ML and statistics are a subset of maths. As I said the statistics and ML overlap and also function interpolation. But some ML algorithms are based on biological systems (like swarm optimization), or theory of information.
If you have a problem that you want to classify some vectors, you have different ways to do it. You call all of them ML, but some use statistics, others use interpolation, other uses theory of information, etc. The model doesn't have to be complex or require a lot of data. Instead of saying all the different techniques you sum up saying ML.
A lot of it went over the head because I don't know much classical statistics, but I read some articles by stats people that basically boiled down to the distinction not being in the techniques but in common assumptions, rigor, culture, etc.
The best distinction I've found between ML and statistics is the following.
Statistics is about modelling the underlying probability distribution that generates your data. A convergence/generalization/etc result will usually be dependent on this underlying distribution.
ML is when you don't care much about the underlying distribution (modulo regularity assumptions), and your model doesn't even come from the same family at all.
I.e. linear regression is usually statistics, because you often believe the underlying data looks like f(x) ~ f(x0) + f'(x0)(x-x0)). Random forests are machine learning because you don't actually think the real world secretly has a random forest flotaing around.
> A basic k-means or naive Bayes approach will show up in ML textbooks, but those aren't clearly different from "use some statistics to make a prediction".
Ha, sounds like a classification problem! Let's use ML to find the boundary.
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." - Tom Mitchell
Actually there's a very clear definition of what types of problems ML ought be used for, and that category of problem is what defines it. Those familiar with regression (and stats in general) ought to be familiar with it already - it's an issue of relationship of datatype between independent and dependent variable.
In brief, you're going to run up against two types of data - categorical and continuous. (There are facets to this, eg ordinal, but these are really the elemental types of data). The relationship of datatype to independent/dependent variable is what determines what kind of analysis you may conduct.
Categorical Independent vs. Categorical Dependent, for example, is fairly restrictive, as makes logical sense. You may cross-tabulate, you may score likelihood based on previous observation, but obviously, because all of the data involved are non-numeric, there's no chance for regression, ANOVA, etc. Linear Regression is used when both independent and dependent variables are continuous, and cross-category differencing techniques like ANOVA may be used when the independent is categorical and the dependent is continuous.
The part you don't typically learn until grad school is when the independent is continuous and the dependent is categorical, ie, in ML, a classification problem. The standard statistical methods used as foundation for these problems are logistic regression, logit/probit. It's expansion of these methods that lead to ML in the first place.
If I'm reading this correctly, it's just wrong. Whatever the distinction between data analysis and ML might be, it is more than just whether your data and predicted quantities are discrete or continuous.
> Categorical Independent vs. Categorical Dependent, for example, is fairly restrictive, as makes logical sense. You may cross-tabulate, you may score likelihood based on previous observation, but obviously, because all of the data involved are non-numeric, there's no chance for regression, ANOVA, etc
If you are implying that categorical -> categorical predictions are not ML: as a counter example, natural language is a categorical (words) input that could be used to predict any number of categorical variables (parse trees, semantic categories, etc). I think it's safe to say that the field of NLP is doing machine learning.
Thanks for the sanity check. I read that reply, and got bogged down enough that I was worried my initial reaction of "what, that's not relevant!" was born of ignorance. Discrete/continuous is a distinction worth making, but as a hidden 'definition' for ML I really don't understand it.
The value added isn't in the use of ML techniques itself, it's in the hype train that fills the Valley these days: our customers see "Data Science product" and don't get that it's really basic predictive analytics under the hood. I'm not sure the product would actually sell as well as it does without that labeling.
So you are misleading your customers through omission? This is the kind of thing that makes people question anyone stating they are using ML. Those of us actually implementing ML techniques (aka training neural nets and automating processes with data) are met with unnecessary skepticism as a result.
edit: OP clarified his position since this post so take that into account when reading.
No, we actually use ML. We just don't need to, in my opinion, because the problems our products solve are more or less solvable without these techniques.
My point was that using ML, even though we don't need to, "adds value" by virtue of the hype train. We need ML to sell products, not to create them.
I do agree that this sort of arrangement lends itself to supporting skepticism around AI and ML. On the other hand I don't think that's a bad thing.
Got it. Thanks for the clarification. It is true that people are using ML where other, simpler options are available, but I wouldn't immediate discount the value of using nets for your problem. I don't know enough about your problem/implementation to speak to it really.
Yes, thanks for highlighting the deficiency in my original post. I can see it is easily interpretated as you did. I added a clarification (or what I hope is one).
One of the products the company I work for sells more or less attempts to find duplicate entries in a large, unclean data set with "machine learning."
The value added isn't in the use of ML techniques itself, it's in the hype train that fills the Valley these days: our customers see "Data Science product" and don't get that it's really basic predictive analytics under the hood. I'm not sure the product would actually sell as well as it does without that labeling.
To clarify: the company I work for actually uses ML. I actually work on the data science team at my company. My opinion is that we don't actually need to do these things, as our products are possible to create without the sophistication of even the basic techniques, but that battle was lost before I joined.