Booststrapping a Modern Data Science Education

September 4th, 2018

We are entering a new age where we will need to be continuously learning. Everybody follows the journey in their own way. We have to be a good student as well as a good teacher.

Many of the world's universities such as StanfordMIT, and Tsinghua are releasing their courses online as Massively Open Online Courses, or MOOCs for short. It's an exciting time to enter the data science field, and universities need to think through the value they can add as content becomes commoditized (though, this is a larger subject worthy of its own piece).

DataRobot is known as a product company but what few realize is that DataRobot is also an education company. Below are some thoughts and resources to help you accelerate your data science educational journey, as well as an introduction to the game changer that is automation — making modeling accessible to all.

First, let's decompose what it means to be a data scientist. There are many definitions, but it is best represented visually in this Venn diagram popularized by Drew Conway:

That is to say, one needs to have some degree of knowledge of the underlying techniques to be applied, the programming ability to implement the vision into a reality, and the domain knowledge to understand how an analysis or model will add value. If you are sharpening your skill set in any of these three areas you are systematically improving yourself as a data scientist.

 

Let’s look into some references within each skill set:

Math & Statistics:

  • Prediction is at the heart of data science, and some would say it’s the core of science itself. Elements of Statistical Learning by Hastie, Tibshirani, and Friedman is often considered the gold standard in introducing the theory of many of the common predictive modeling algorithms. Introduction to Statistical Learning is a simplified reference written by many of the same authors and colleagues.  Why not hear the algorithm explanations straight from the horse's mouth as the authors offer a nice Statistical Learning MOOC through Stanford?

    • We recommend to start with the linear or logistic regression algorithm as a base. Logistic regression is the building block of fancier algorithms such as the deep learning algorithms that you often read about today.

Programming Ability:

  • Data science is a team sport and coincidentally every software engineering project you undertake is also collaboration. At the minimum, it is a collaboration between you and your future self. Therefore, you want to strive for code that is readable, well-documented, and tested and thus easier to maintain. These are universal principles regardless of whether or not you write the analysis in R/Python/Julia/etc. Remember to ‘be kind to your future self’.

    • R & Python are the modern lingua franca’s of data science.  Within DataRobot, a common question we often get asked is, “Which one should I use?” Ultimately we actually see this as counterproductive and instead recommend focus on the problem at hand, and reach for the tool you believe presents the lowest friction for you to reach a solution. One can try both and focus on building a deeper foundation within the approach that comes more naturally. Within DataRobot we have made it a point to offer our DataRobot API in both languages so the user is not faced with an ultimatum.

Domain Knowledge:

  • When presenting your work, the last thing you want to convey is an image of an out-of-touch data scientist who is disconnected with the real world. Therefore, as a first project, the only recommendation we have here is to apply your knowledge to a field you are intrinsically passionate about. The storytelling and communication aspect of the project will also be easier as a result.

There are many great educational platforms such as coursera, edX, udemy, udacity, datacamp, kaggle learn, fast.ai, etc. No one platform has a monopoly on great content and within the rapidly changing space of data science, you should seek truth and knowledge wherever you find it.

However, as unbiased as we tried to be within DataRobot, we can’t hide our fondness for DataCamp. The focus on data science and the learning-by-doing approach makes them a company worth checking out as a burgeoning data scientist.

“The measure of how well you learned something is the degree to which you can build with it” - Rachel Thomas, Co-Founder fast.ai

The economy is built upon products and services. To make a living as a data scientist, you need to build something or teach others. Therefore, we strongly recommend that as soon as possible, one get into trenches and start applying the knowledge. This can be done through a Kaggle competition, a reproducible analysis via a blogpost, an open source contribution, real world applications within your community, etc.

The Game Changer in Automation 

MissingT-01

Acquiring the quantitative skill set is essential in the modern economy, but there is no getting around the significant upfront time investment required. This will take months or years to acquire only to find out later that much of what you learned will not be valued by the business. (EX: Your boss probably will not care how many hidden layers you used within your neural network.)

FixNumberedList

DataRobot was one of the first to build a platform that automates the mathematical and programming aspects of the workflow and thus enabling the user to operate on a higher level of abstraction and focus on the core business problem.

 

For more information, visit: https://blog.datarobot.com/booststrapping-a-modern-datascience-education