Rapid (and Cost Effective) Data Science Proof of Concept

It is extremely expensive nowadays for companies to hire top data science talent. Companies like Google, Amazon, Netflix, etc. have very well defined data sets, can afford heavily PhD’d computer scientists and can survive if expensive projects don’t work out. Smaller businesses that have realized the value of data and modeling, need an inexpensive methodology to test out new ideas with their data. I’ve laid out here a step by step process to go from concept to proof of concept in only a few days or weeks with a small team.
Machine Learning
Credit: xkcd.com

 

Step 1: Identify the data science problem

What question are you trying to answer and can it be answered with analysis of Data? The best questions are narrow enough that they can be answered with a limited data set, but allow room to measure error. Bonus points for developing a p-value test with a null/alternative hypothesis.
            Bad Data Science question: What will the price of my house be next year?
            Ok Data Science question: Can I predict home values in my neighborhood?
            Great Data Science Question: With what confidence can I predict home values in my neighborhood?
Frequentists vs. Bayesians
Credit: xkcd.com

Step 2: Data Selection

Real-world data sets are messy and incomplete. Acknowledge any limitations of the data, consider bringing in outside data that may improve your analysis. The US government has hundreds of thousands of free datasets from including census data, consumer complaints, student loan numbers, and much more. Many sites offer API’s to source data, eBay, Zillow, Reddit, are just a few examples. And if all else fails, a web scraper like Beautiful Soup can brute force your way into a useful data set.

Step 3: Explore the data

Find the irregularities, and decide how to handle missing data. Understand the range of values in every feature. Continuous or Discreet. If the values are discreet, can the values be treated as categories? If the values are continuous, is a one unit increase effective? What can be gained from creating new features?
Image result for xkcd accuracy precision
Credit: xkcd.com

Step 4: Model(s) Selection

Decide on your modeling technique(s) and set parameters for the success of the models you decide to use. Understand the cost and benefit of each model. Model the data with as many modeling techniques as is feasible. Below is a general outline of options.
Supervised models: When we know what the correct answer should be
  1.  Regression – For when output variable is a real and continuous value, such as weight, or dollars
  2. Classification – Output variable is a category, such as a color or true/false

Unsupervised Models: There is no target variable, the goal is to find underlying relationships.

  1. Clustering: When you want to find groupings in the data
  2. Association: Finding rules that describe largest portions of data, for example, people who buy X, tend to buy Y
Credit: xkcd.com

Step 5: Evaluate the model.

Was it successful? What did you learn? No model is perfect, so having a metric for how well the model performed is necessary for improvement. With regression problems, getting close to the target is usually the goal, so an Rscore is usually a good metric. But in classification models, some errors can be worse than others depending on your subject matter. This is a topic for another post, but understanding the relationship between accuracy, precision, recall, and ROC is the most important criteria in evaluating how good a model is. What conclusions can you draw? Make sure to test on unseen data to understand if your model is overfitting.
Image result for xkcd model evaluation
Credit: xkcd.com

Step 6: Visualization for Presentation

Data visualization should be done several steps along the way, but formatting the data in a way that is accessible to the stakeholders is just as important as developing a useful model. Use of tools like Tableau, Plotly or even Excel, will make or a break a project. Visualization is more than just making graphs that tell stories, but sometimes you have to create images or diagrams that just explain your thinking or workflow.
xkcd
Credit: xkcd.com

Step 7: Explain your results.

Along the way, take notes on what could be improved or done better. Make sure to document everything so that someone else can easily pick up it up and improve. The last line of every project should be the next steps for future work.
Self-Description
Credit: xkcd.com

Step 0: Keep it Simple

I’m an engineer and I like to optimize, but most of the time, you just need a quick and dirty solution. Get it done quick, and keep your goals small. Once you get an answer, you can begin optimizing, but optimizing first will get your team stuck in a data hole. Finally, verify at every possible avenue that your code and your data is in the form you need it and is behaving how you expect.

Image result for xkcd model evaluation
Credit: xkcd.com

Conclusion:

The idea that machine learning and artificial intelligence require extensive specialized education is a myth, largely because of the democratization of the field of Data Science. For business leaders looking to harness the massive ROI that can be had from taking a machine learning approach to your data, find a technologist that can ask the right questions and get the information as fast as possible.
Credit: xkcd.com

Image result for data xkcd
All the images in this post are borrowed from xkcd.com

Leave a comment