Data Science From Scratch

I visit Quora regularly and am always surprised by the number of people asking how to become a data scientist. It’s a fascinating field, and one I was able to (mostly) “bootstrap” into, out of a quantitative PhD (bioengineering). This is my simple guide on how to become a data scientist.

Get in touch if you have comments about this guide, or if you have questions about data science or careers.

Understand Programming

For true beginners, tutorials like W3Schools will get you off the ground.

For practical skills, Cracking the Coding Interview is a highly regarded guide. I’d also recommend coding challenges like leetcode and codewars.

It’s useful to know about data structures and algorithms. I learned this topic mostly ad hoc. CLRS has been recommended to me, though I haven’t read it myself.

For background on the history and culture of programming, Paul Ford’s book-length essay What is Code? is truly indispensable. It’s the best essay I have ever read on programming and maybe the best ever written.

Python and SQL are the most important languages for data science.

Understand Machine Learning

The absolute best place to begin is Andrew Ng’s Machine Learning Course. Regression, classification, gradient descent, supervised, unsupervised. Even a brief piece on neural networks. Essential knowledge, good presentation, good depth, good assignments.

“Introduction to Statistical Learning” and “Elements of Statistical Learning” are a great resource, expanding on Ng in some ways. ESL goes into considerable depth, and the PDFs are available online from the authors.

Machine learning has moved on and it’s good to be familiar with the new techniques, namely XGBoost and deep learning.

For XGBoost, reading through the documentation is a good place to start. There’s also a scholarly paper about it.

For neural networks, I recommend Stanford 231n Winter 2016. I think there is a newer version up, but I have not studied it yet.

Understand Statistics

In some ways, this may be the least important piece of it, at least for starting out. It’s rarer to get an interviewer who really understands statistics. Still, I very much recommend understanding statistics, and I value the statistics I have learned. In the long run, it will help you justify and be confident in your results.

The point of entry here is Brian Caffo’s Biostatistics Boot Camp. It’s a little bit dry, but I truly valued the precision and rigor in this course. There is also a part 2.

I read the textbook “Statistical Rethinking” on Bayesian modeling and thought it was excellent. There’s also a free PDF online from the author of “Doing Bayesian Data Analysis”, which has been recommended to me.

“Mathematical Statistics and Data Analysis” is an excellent, comprehensive reference, but maybe not worth reading all the way through for most.

Other areas of statistics I want to learn more about: information theory, latent variable modeling, factor analysis, linear modeling.

Keeping Up

That’s it. If you understand programming, machine learning and statistics, you are well on your way to landing a data science job.

It’s important to stay on top of the game. Podcasts, meetups and forums are good for this. For podcasts, the Google Cloud podcast is good, and Data Skeptic is also not bad. The best forum I know of is probably Hacker News.

It’s also important to explore your own interests as well. I’ve been finding myself more towards the ETL/data engineering/infrastructure end of things rather than pure analysis.

Leave a comment