First of all, I’m not a data scientist. I am a statistician at an internship for a company that makes machine learning software. I’m still a student, and I will be for the foreseeable future.

My route to sunny San Diego, and Salford Systems comes through Blacksburg Virginia, and the Conference on Statistical Practice. I presented a poster on practical knowledge gained while working in Virginia Tech’s collaborative statistical laboratory (producing acceptable results from statistical collaboration), but saw many other posters and short lectures over the two days I stayed. Between the talks, I embarked on a mission to collect as much swag as possible, as is customary at every conference and convention. With my pockets (and my complementary CSP bag) full of flash drives and pens, key chains and other miscellanea, I came to the booth for Salford Systems, and struck up a conversation with David Tolliver. One demo, one short interview, and many conversations later, I got a job.

There were many vendor booths at CSP. I talked with R studio, and the Census Bureau before arriving at Salford’s booth, so why did I want to work for this company specifically? TreeNet. I worked with a free version of the algorithm, and compared its performance in a classification setting to a second order logistic regression. I found it to be a more nuanced model, which substantially increased performance. I was excited to work with implementing the algorithm. In conversations with the CEO, we talked about new innovations, and comparative projects in the works, and my statistical interest was piqued.

I was a bit disappointed when I arrived. I haven’t worked with the algorithms at all. I’ve barely done any analysis. Mostly, I gather data because my web scraping skills open up new options, and software companies need data to demo their product. My statistical interest has…waned.

Maybe it’s just my professors speaking, but it is a bit aggravating to hear a few models triumphed as a panacea. To put my frustration into more statistical terms, around Salford, the prevailing opinion is that TreeNet, Random Forest, and MARS make every other modelling technique inadmissible. Want to update your regression techniques? Use these methods for literally everything. Time Series? Spatial Models? Experimental data? Observational Study? Plunk it right in there. It’s all the same.

Since I’m sure the last paragraph sounds overly critical, let me be clear before I give anyone the wrong idea. If I didn’t think these algorithms were amazingly powerful, and profoundly useful, I wouldn’t have been interested in working here. I think the advantage of Salford’s algorithms is not building the best model, but building the best model that can be realistically found.

Any parametric (read “traditional”) model requires you to make some assumptions about the relationship between predictors and your response. When these assumptions are correct they greatly improve the model, but an unreasonable assumption can severely handicap you from the start. Consider the assumption of linearity. If the relationship is linear, or reasonably linear, you get very good results, but sometimes that assumption can indicate there is no relationship, when the relationship is very strong. Below, we can see the results of the strictly first order linear assumption, when the relationship is very strong, but second order.

A perfectly flat line clearly does not represent this data well. While in this case (univariate, and easy to plot), it may be trivial to fit the true relationship by trying a polynomial model, the data are sufficiently large that a machine learning model can obtain almost the perfect fit without the necessary discovery process on the part of the scientist.

Immediately the obvious pattern in residuals disappears. The technique (TreeNet) has drastically improved the model however, since I simulated the data, I know the best model would be a perfect curve. Every little dip and corner indicates a small overfit by the model.

It is easy to see that there is a better model, and in fact it is easily obtainable, but in realistic (or just more complicated) cases, the “best model” may not be anything quickly discoverable. Consider the simulation below, which has a functional form I designed, but is not easily recognizable. The strict linear model does not perform well, and even a fifth order Taylor series fails to improve the performance satisfactorily (it might be a little worse).

The same model that produced the slightly overfit curve above does an almost ideal job of fitting this abstract function. We can see the same telltale deviations in the curves, and the model still underfits the middle dip, but overall this is a deployable model, as opposed to one that is correct on average, and wrong pretty much everywhere.

If you discover the actual way these variables are related, you will get a better model, but you could try a thousand combinations and get nothing comparable to what TreeNet provides. I simulated the data, so I know the actual form, but I forgot the exact terms, and accidentally fit something much worse.

This is a relatively simple example. There’s only one variable, which eliminates concerns of interactions, and multicollinearity. There’s no need to discover the functional form of multiple variables simultaneously, or explore severe nonlinearity (the true model is still “linear”), and the data can easily be plotted with predictions for a quick visual inspection of the results. In any real world scenario, the task of finding the perfect function to fit the data would be much harder.

So, does machine learning provide the best possible model for every scenario? No. Putting aside the concerns over existence of a describable (however complex) functional form for every relationship we can at least say that in physical models, there is a formula that will provide a closer fit than that yielded by machine learning. You might spend years, and years finding it. By the time you get the best possible fit, it might not matter anymore (it doesn’t help much to figure out exactly what makes a customer risky if your bank has already lost all of its money), but there might be a perfect model for your data.

The “best model” might exist, but it’s not always attainable. Getting a close model quickly is infinitely better than having no viable model at all.

Maybe next time I’ll discuss why predictive accuracy is not the only statistically relevant task.