During recent travel for work and vacation, I faced the challenge of describing my job to a lot of different people in many different settings. As I mentioned in recent posts, the opportunity to describe my work always provides valuable practice for communicating and understanding important aspects of my job. This practice became a lot more difficult since I started working in data science. Data science is esoteric and at the same time vague. I noticed a new degree of difficulty in avoiding technical jargon and details while at the same time reining in abstract discussions of machine learning and “big data” with concrete examples.
At the Conference on Knowledge Discovery and Data Mining (KDD) I talked to two data scientists on different teams in the same company; one worked in marketing, the other in risk analysis. They knew enough about each other’s work to have an idea of which software each other’s team uses for analysis, but they worked with completely different kinds of data and were not aware of all the modeling approaches a data scientist working on different business problems might use. Even within the same company the technology and the paradigms of problems can vary greatly with the datasets and the problems at hand.
Talking about a particular project, we can directly get at some really important applications of data science. However, the details along the way limit the discussions and overall takeaways. Speaking with someone about a fraud detection project may lead you to think data science focuses largely on quick predictions and classification—advanced data engineering oriented towards machine learning and streaming data may seem like key skills in this domain. However, someone who works in marketing may focus a lot on unsupervised learning to segment markets or group customers. They need very reliable data, but it may be a few months or a year old and stored in various ways. Should data scientists learn from thousands of cases, as lawyers and doctors do, to grasp all the methods and technology available and their applications? By the time we document such a curriculum and institutionalize it, all the technologies may have changed.
Since we develop software and consult for all kinds of industries and research problems here at Salford Systems, I face a different challenge when describing my work. I start from a tools perspective rather than a particular problem, like detecting fraudulent credit card activity. We once worked with a grocery store chain to build a recommender system for stocking items. I actually tested a similar modeling strategy to predict which characters an online gamer would perform best with. The data sets are very different in terms of data types, size, and complexity, but the two problems are almost the same. In some ways we cannot think of shoppers and items in the same way as players and game characters whereas, in other areas of interpretation, the analogies and the math transfer easily.
Even when speaking with other data scientists, describing what I do day-to-day can leave people without answers to key questions such as:
What goes into the model? Is the dataset intrinsically clean or dirty? How is it stored and accessed? If data is not already stored and accessible, how do you scrape data and how do you decide what variables you want?
What comes out of the model? Are you going for predictive performance? Does interpretability matter? What information actually gets used in terms of variables plots or metrics of interest?
If you try to talk about a particular algorithm then you have to instantiate the concepts with example data and objectives to demonstrate any possible interpretations of models and to discuss when the technique is advantageous or appropriate. If you frame the dataset and problem first, then bring in data science as an answer to the problem, you narrow the scope of interpretations and considerations where the real art resides. Some of our customers in fraud detection are actually quite adamant about trading accuracy for flexibility, speed, and interpretability since fraudsters move around and change tactics on the timescale of days. However, offering financing for furniture to account holders would require different accuracy and modeling considerations even when using the exact same dataset and still focusing on binary classification. We wouldn’t worry about target customers changing behavior rapidly, and misclassification would have a very different cost (e.g. a missed sale on a credit line vs. inconveniencing a card holder to confirm suspicious activity)
Good data science, though, is elegant. A good recommender does not show you terabytes of data or complex network analytics—it shows you a good restaurant or a new song with low latency, non-invasive presentation, and graceful timing. Good predictive analytics sometimes won’t even show you a single decision tree. What people want to see is increased ROI, a list of potential quality leads to contact (or avoid!), a percentage of deliveries getting to the right place within a time frame, or a projection of how much revenue a project will bring in. Data science makes our lives and decisions easier by packaging all the bolts and wiring behind a façade of simplicity at the end of the day.
When I fielded questions about what I do at Salford, I often found myself flipping through our marketing materials to show people figures of partial dependency plots first. Leading with the results really helped people understand what I do, and what our software has the potential to do, through a real example. In my last post on communication, I talked about leading with results and how it helped with my upcoming webinar (make sure to register for part I and part II coming September 14th and 21st 2016!). For the webinar I used real presidential election data and demographic metrics from 1992 to show how you can use regression for microtargeting insights.
This dataset is tractable and shows how we can take seemingly complicated aspects of a community—demographics and voting behavior—and use them quantitatively. The dataset also contains examples of categorical and numerical variables as well as missing values. The most important thing, however, is that I have already built models with the data and I start explanations from a partial dependency plot:
To use regression for microtargeting, you build a model to predict a target or outcome, like the percent of the vote for a candidate, using other variables, such as the average income of a county, as shown in this plot. What is great about using regression for predictive analytics is not just prediction with granularity targeted to groups or even individual voters, but the ability to use partial dependency plots like these to really uncover the story beneath the predictions the model makes.
For varying amounts of income, this partial dependency plot shows a simulated average of the response over the range of each of the 12 other variables in the model. We plug in real values for the other variables from the data used to the build the model, for example crime and population size of a county. Varying the income and plotting the model response gives us a response curve. Plugging in different values for the 12 other variables from the data, we make more “response versus income” curves for these different conditions and then average all the resulting curves together. This allows us to see the voting trends that our model picks up on for just income after accounting for other variables and, in a way, hiding the variations attributed to other factors.
We can see very clearly how increasing a variable has a positive or negative effect on response, in this case higher average income has a negative effect on the percent of the vote for the democratic candidate. We can also see where the model is picking up on local trends or non-linearities.
You can see that our MARS® model fit basis functions at around $22,000 and $52,000 to capture changing trends. Interestingly, the tax brackets for 1992 changed at $21,450 and $51,900. So non-linear regression splines automatically identified the tax brackets and different voting trends for each bracket without using any information about tax laws or the Democratic Party’s economic platform! The Salford Predictive Modeler® software suite automatically generates standardized partial dependency plots for all our regression models, so you can visualize and compare these trends using other regression methods including decision tree ensembles, like Random Forests® models.
Using a single result—a single plot—I can demonstrate what kind of data can be used and generally how we can use data for predictive analytics with a particular algorithm. From this results-point-of-view I can talk about how you can get similar insights for the percentage of people that respond to a treatment or a product advertising campaign in an analogous way to a political mailer campaign. I can also pivot to talk about other regression and machine learning methods that can leverage partial dependency plots for interpretation. Most of all, I can use this relatable example of an election and this one plot to spark conversations after instilling the potential impact and value of using data science. All this comes from a relatively old and small dataset for the sake of example—but the results are real and are right in front of us. With an already complete and concrete example I can show real value that data science already uncovered and still easily talk about different datasets, models, and scenarios.
Many technologically skilled professionals who are actively interested in data science had a lot of questions about what my job is like or about getting started with data science for themselves or their business. And yet, almost every one of them has interacted intimately with data science through analytics, advertising, or data products and services. Data science is everywhere, but it is easy to forget that data science is not a household term yet. Blog posts like “Reservoir Sampling and Neural Networks for Streaming Data and the IoT” or “Up-Lift Modelling for Cancer Treatments” are interesting and useful to data scientists in particular fields or to particularly creative and savvy readers. However, starting out with esoteric use cases or technical descriptions of specific technologies severely limits the “what,” “why,” and “how” of analysis, if not leaving it out altogether. Outside of the data science hubs like San Francisco, New York City, Los Angeles, and a side-eyed acknowledgement of Austin (you know, the capital of Texas? No, not Dallas) many people have not even heard the term “data science” despite being surrounded by it. What data scientists’ bosses, clients, and consumers want to see are results. The general public and potential users of data science are no different.
Starting from current results and building value first is a boon to everyone. People will be more inclined to invest their data and their business in data science, not fear overhyped claims of black box automation and robot overlords. Giving a realistic picture of what we are already doing instead of jumping ahead to potentials like streaming patient genomic data will help our clients, the public, and policy makers to better understand the realistic next steps for advancement and the actual data governance concerns of today. We reside in a new and growing industry dominated by terminology and technology on the up-tic of the hype cycle. Better practices in communication and public outreach will help data scientists with more realistic expectations and will help the general public with an understanding of the real value and impact of data science.