Data Science From Scratch

I visit Quora regularly and am always astonished by the number of people asking how to become a data scientist. It’s a fascinating field, and one I was able to (mostly) “bootstrap” into, out of a quantitative PhD (bioengineering). There are three main components to making it in data science.

Understand Programming

For true beginners, tutorials like W3Schools will get you off the ground.

For practical skills, Cracking the Coding Interview is a highly regarded guide. I’d also recommend coding challenges like leetcode and codewars.

It’s useful to know about data structures and algorithms. I learned this topic mostly ad hoc. CLRS has been recommended to me, though I haven’t read it myself.

For background on the history and culture of programming, Paul Ford’s book-length essay What is Code? is truly indispensable. It’s the best essay I have ever read on programming and maybe the best ever written.

Python and SQL are the most important languages for data science.

Understand Machine Learning

The absolute best place to begin is Andrew Ng’s Machine Learning Course. Regression, classification, gradient descent, supervised, unsupervised. Even a brief piece on neural networks. Essential knowledge, good presentation, good depth, good assignments.

“Introduction to Statistical Learning” and “Elements of Statistical Learning” are a great resource, expanding on Ng in some ways. ESL goes into considerable depth, and the PDFs are available online from the authors.

Machine learning has moved on and it’s good to be familiar with the new techniques, namely XGBoost and deep learning.

For XGBoost, reading through the documentation is a good place to start. There’s also a scholarly paper about it.

For neural networks, I recommend Stanford 231n Winter 2016. I think there is a newer version up, but I have not studied it yet.

Understand Statistics

In some ways, this may be the least important piece of it, at least for starting out. It’s rarer to get an interviewer who really understands statistics. Still, I very much recommend understanding statistics, and I value the statistics I have learned. In the long run, it will help you justify and be confident in your results.

The point of entry here is Brian Caffo’s Biostatistics Boot Camp. It’s a little bit dry, but I truly valued the precision and rigor in this course. There is also a part 2.

I read the textbook “Statistical Rethinking” on Bayesian modeling and thought it was excellent. There’s also a free PDF online from the author of “Doing Bayesian Data Analysis”, which has been recommended to me.

“Mathematical Statistics and Data Analysis” is an excellent, comprehensive reference, but maybe not worth reading all the way through for most.

Other areas of statistics I want to learn more about: information theory, latent variable modeling, factor analysis, linear modeling.

Keeping Up

That’s it. If you understand programming, machine learning and statistics, you are well on your way to landing a data science job.

It’s important to stay on top of the game. Podcasts, meetups and forums are good for this. For podcasts, the Google Cloud podcast is good, and Data Skeptic is also not bad. The best forum I know of is probably Hacker News.

It’s also important to explore your own interests as well. I’ve been finding myself more towards the ETL/data engineering/infrastructure end of things rather than pure analysis.


Deep Neural Networks: CS231n & Transfer Learning

Deep learning (also known as neural networks) has become a very powerful technique for dealing with very high dimensional data, i.e. images, audio, and video. As one example, automated image classification has become highly effective. This task consists of putting an image into one of a certain number of classes.

Look at the results of the ImageNet Large Scale Visual Recognition Challenge. This is an annual challenge for image classification, recognition and detection algorithms, for images from 1,000 categories. The top-5 error rate has decreased from 28% in 2010, to 26% in 2011, to 15% in 2012, then 11%, 7%, 3.5%, to 3% this year. (Classification error). Other tasks continue to improve with the use of neural networks, especially convolutional and recurrent neural networks.

The number of resources for learning about neural networks has also multiplied dramatically. One resource I can recommend firsthand is the CS231n Stanford course about image recognition. The course syllabus, including slides and lecture notes (and Jupyter notebook assignments) are available online. Additionally, the lectures are available on YouTube. I’m up to lecture 5 and I can highly recommend both the slides and YouTube lectures. There is really excellent theoretical background on topics like the history of image processing, loss functions, backpropagation, but also practical advice on weight initialization, transfer learning, learning rate, regularization and more.

Another resource that has been linked recently on HackerNews is the Yerevann guide to deep learning, which seems to be a good, deep source of information.


The folks at TensorFlow just keep improving their offerings. I recently implemented their transfer learning using Inception v3 using 8 categories I chose from ImageNet. It was surprisingly easy to train my own classifier, which was quite effective (top 1 error rate <10%)! The only real issue I had was some bazel errors which were resolved by upgrading my version of bazel. Training on roughly 8,000 images took about 12 hours on a MacBook Pro using CPU only. To be more specific, the bottleneck phase took about 12 hours, while the actual training about 20 minutes. Using ~1,000 images per category, is probably not totally necessary for an effective classifier, so you can likely cut down on this time dramatically.


Collecting Neighborhood Data

Since deploying my latest web application, a Los Angeles Neighborhood Ranker, I’ve wanted to explain the process of gathering the data. The first step was to decide which neighborhoods to use, what they’re called, and how they’re defined, geographically.

The first stop I made is the LA Times Mapping LA project. It has a mapping feature for many Los Angeles neighborhoods. Each neighborhood has an individual page where the neighborhood is plotted on a map. After inspection of the page’s source code, the coordinates of the neighborhood’s boundaries can be discovered:


"geometry": { "type": "MultiPolygon", "coordinates": 
[ [ [ [ -118.483981, 34.041635 ], [ -118.483766, 34.041430 ]...

This information can be collected and stored for further use via web scraping using a Python library like urllib2 to open the page and regular expression to find the coordinates. The following shows the essential steps:

import urllib2
import re
### Code here to define fullurl
webpage = urllib2.urlopen(fullurl)
for lines in webpage:
  d ="geometry.+(\[ \[ \[ \[.+\] \] \] \])",lines)
  if d:
    ### store

Storing these coordinates to text allows them to be used in mapping and data gathering. Mapping is accomplished with leaflet.js and Mapbox, and I can describe that later. For now I will talk about how the nieghborhood coordinates help in the data collection process.

For the neighborhood ranker, I needed neighborhood-level data in categories like house price and others. Unfortunately, the existing real estate APIs (Zillow e.g.) don’t support neighborhood-level data, nor would it be likely they have exactly the same neighborhood definition.

The Zillow API, which I used for real estate data, only supports single house queries based on street address. So, how to go from single Zillow queries to average house price in a neighborhood? A first natural instinct may be to collect information on every house in a neighborhood and compute the average.

However, perhaps it isn’t necessary to get every value. Consider surveys like the census and employment situation surveys. These gather important information about populations without getting information from every member of the population. Instead they use knowledge of statistics and error to estimate the values and quantify uncertainty.

Thus, getting every house in the neighborhood is unnecessary. But how many are necessary? We estimate the standard error of the mean using sample information:

SE,mean = s/sqrt(n)

Where s is the sample standard deviation and n is the number of samples.

Thus, if the distribution of prices for each neighborhood is relatively tight, you wouldn’t need many samples at all to get a good estimate of population mean.

In general what I found is that only about 5 samples were needed per neighborhood to get a good estimate, with estimated error about 10-20% of sample average. This is sufficient to rank the neighborhoods as they vary in average price from $300,000 to $5,000,000 or about 1500%.

But how do you generate random samples within neighborhoods? Zillow doesn’t support it. One thing that may help is to first generate a random coordinate-pair within the neighborhood. I use the following:

def get_random_point_in_polygon(poly):
  (minx, miny, maxx, maxy) = poly.bounds
  while True:
    p = Point(random.uniform(minx, maxx), random.uniform(miny, maxy))
    if poly.contains(p):
      return p

The poly argument is a Polygon object from the Shapely library, which defines the neighborhood boundry. The function returns a randomly generated Point object which is inside the Polygon object. We can actually plot the neighborhood and look at the coverage we get with let’s say, 50 points.

xy = coords['santa-monica']

xylist = zip(*xy)
p = Polygon(xylist)
points = [get_random_point_in_polygon(p) for i in range(50)]

ax = plt.subplot()
for point in points: ax.plot(point.x,point.y,marker = 'o')

This is the result:


50 random points in the Santa Monica neighborhood.

There is one more step. Zillow accepts API queries using addresses, not coordinates. You can get addresses from coordinates by using a ‘geocoding’ API. I used Google’s but there are others available. Using the address information, the Zillow API may be called, and the price information extracted. A similar method can be used for Yelp API and restaurant density.

On a technical note, this method may not give a completely random sample from a neighborhood. A random point will be assigned to the nearest address during the geocoding step. Thus, addresses in high density areas are less likely to be chosen. If price correlates with density within a neighborhood, this could yield biased samples.

Ranking Neighborhoods in Los Angeles

My latest web application is an interactive, personalized map-based ranking of neighborhoods in Los Angeles. I’ve spent the last few weeks gathering data points for each of 155 neighborhoods in the Los Angeles area. That on its own could (and will soon) be the topic of its own post.

For now, I wanted to explain the ranking algorithm I use to come up with the rank score.

First, the features I have gathered on each of the neighborhoods:

  • Property Crime
  • Violent Crime
  • Air Quality
  • School Quality
  • Population Density
  • Restaurant Density
  • Price to Buy
  • Price to Rent

How to rank neighborhoods based on these characteristics? How to do so in a way that allows users to choose which features they care about and their relative importance?

First, assign each neighborhood a score within each category. For purposes of simplicity, I made each score from 0 to 100. The best neighborhood in each category would rank 100, the worst 0. But how exactly to determine this? Some features are clearly better when low (price, crime) and some are better when higher (walkability, school quality). Furthermore, there is the choice of whether to apply any other transformations to the data such as log-scaling etc.

So I defined functions that transform the data in desired ways.

import math
def standard(n):
	return n
def neglog(n):
	return -math.log(n)
def neg(n):
	return -n

Now I can define a dictionary pairing each dataset with the desired transformation function.

 fundic = {
     'aqf':neg, #air quality
     'hpf':neglog, #house price
     'rpf':neg, #rent pruce
 'pcf':neg, #property crime
 'pdf':standard, #population density
 'rdf':standard, #restaurant density
 'sqf':standard, #school quality
 'vcf':neg, #violent crime
 'wsf':standard, #walkscore

Python treats functions as first-class citizens (basically as variables) so this allows us to pass the function from the dictionary basically as any other variable. Within a loop that iterates over each data set, we can select the function corresponding to the dataset stem of choice.

 for stem in list_of_stems:
      fun = fundic[stem]

Now, while loading the data set, we apply the function

 featdic = {}
 for line in datafile:
     feats = line.split(',')
     featdic[ feats[0] ] = fun( float( feats[1] ) )

Here we load the raw value into a dictionary keyed by the neighborhood name. These appear together in the data storage file separated by a comma. Now find the max and min.

 vals = featdic.values()
 minval = min(vals)
 maxval = max(vals)

Now to put the data in order and normalize it.

 orderfile = open('name_order.txt','r')
 for line in orderfile:
     name = line
     rankscore = ( featdic[name] - minval)/( maxval - minval ) * 100.0

So this gives us a rank score now normalized to 0 to 100 and transformed by negative, log or any other transformation we’d like.

Now to address another question: given the user’s rank order of features, how to use that to generate a composite ranking? There is no perfect answer to this question. Of course, you could give the user control over the exact weighting, but this seems like unecssary feature bloat. Instead I decided to weight harmonically. The harmonic series goes like: 1, 1/2, 1/3, 1/4 … This weighting is arbitrary but seems quite natural. First place is twice as important as second, three times as important as 3rd and so on.

On the server side, I have Flask waiting for the names of columns to include in the ranking. First I have a copy of the order of the names in the database. (In this case just a list of lists in a .py file, but could be in a SQL database or other configuration)

names = ["prop-crime", 

Now I can define a function to get the composite ranking. First get the number of columns.

def getrank(cols):
    n = len(cols)

Load in the names and the list of scores.

    scoremat = scores()
    nameslist = names()

Now calculate the overall weight from n columns (i.e. the sum from 1 + 1/2 + … 1/n)

     norm = sum( [ 1/float(i+1) for i in range(n)] )

And now get the rank and index of each feature in the list.

    rankinf = [(i+1, nameslist.index(j) ) for i,j in enumerate(cols)]

And finally calculate the composite rank for each neighborhood and return it as a list of strings.

    return [ '{0:.1f}'.format( sum( [ 1/float(i)*sl[j] for i,j in rankinf ] )...
        /norm ) for sl in scoremat ]

SQL: Combining tables without JOIN

SQL is a versatile language–there are many ways to get the job done.

One way to compare information from multiple tables is the JOIN command, but there are sometimes alternatives.

Let’s say I’ve got two tables, office and person:


Let’s say I need to know which people have printers. I could do a JOIN ON office_room = office_num. On the other hand, if you just select FROM multiple tables at the same time, you can filter with WHERE.

SELECT name, has_printer FROM office, person WHERE office_room=office_num;


It turns out that SELECT FROM tbl1,tbl2 is actually the same as a CROSS JOIN, also known as the cartesian product of two tables. This joins every row of the first table with every row of the second table. Use WHERE to filter out only the correct rows you want.

Now let’s say I want to know the number of chairs left in each office. One way would be to JOIN, something like:

SELECT office_num,
num_chairs – count(name)
AS chairs_left
FROM person
JOIN office
ON office_room = office_num
GROUP BY office_room;

But another option is a correlated subquery, where the subquery refers to a column in the outer query, and doesn’t use JOIN at all:

SELECT office_num,
num_chairs –
(SELECT count(name)
FROM person
WHERE office_room = office.office_num)
AS chairs_left
FROM office


Correlated subqueries definitely have a performance hit, since each row is evaluated separately, but they can be quite handy.

Note: These recommendations aren’t best practices, but handy syntax to be aware of and are sometimes convenient.

Information Criteria & Cross Validation

A problem of predictive models is overfitting. This happens when the model is too responsive and picks up trends that are due to quirks in a specific sample, and not reflective of general, underlying trends in the data process. The result is a model that doesn’t predict very well. A model can be made less responsive by regularization–i.e. penalizing the model for complexity–and/or by reducing the number of parameters.

Of course, a completely unresponsive model is not correct either, so a balance must be struck. But how much regularization is the right amount? How many parameters is the correct amount? There are (at least) two main ways to test: cross validation & information criteria.

Cross validation

Cross validation is the practice of leaving out some of the data from the training set (20-40% of the data) and using it to select between different models. The notion here is that the leave-out data is ‘fresh’ to the model, and is thus representative of new data that the model would face in production. The various models being selected between can all be tested against the leave-out data, and the one that scores the best is selected.

Information criteria

Information criteria are a slightly different approach. In general, they rely on calculating the joint likelihood of observing the data, under the model, and taking the negative log, a quantity known as ‘deviance’. The information criteria is some function of the deviance. The Akaike Information Criterion (AIC) is two times the deviance plus two times the number of parameters. The information criteria, just like cross validation, can be used for model selection.

Calculating deviance requires a model that includes error, not just a point estimate. (Otherwise: the likelihood of any data point is just zero). In some methods of model generation (i.e. normal equation for linear regression), the error isn’t explicitly calculated. Thus, information criteria can’t be used for these ‘off-the-shelf’ types of models.

When they are used

Information criteria are often used in the context of Bayesian modeling, when the model explicitly includes error, and determines uncertainty in all parameters. The information criteria are somewhat abstract but seem more soundly based, theoretically speaking.

In contrast, cross validation can be used even when error & uncertainty is not modeled. Additionally, cross validation is highly applied and the principle makes sense and appeals even to machine learning novices.

Overall, both methods are highly useful and informative. Information criteria may be more sound, theoretically speaking, and may appeal to academic types for this reason. In general, however, more people are likely to be familiar with cross-validation, and it’s probably easier to explain and sell this technique to a non-technical audience.

A brief aside: Video Productions

On a slightly different note, I’d like to showcase my award-winning video production here.

These two won a total of $2500 in a competition sponsored by UCLA Department of Chemical and Biomolecular Engineering:

These were produced for UCLA California NanoSystems Institute.

These two bike safety videos each won $200 in a contest sponsored by UCLA Bicycle Coalition.

Jupyter Notebook Explorations

I’ve recently been getting to know the Jupyter Notebook environment. It’s a very convenient development environment for many languages, but is well-known as the evolution of the iPython Notebook.

Jupyter works by starting a server on your own computer to execute python (or another language) code. The notebook is rendered in a web browser and consists of cells of code. Cells can be run individually, which sends the code to the Jupyter server on your computer. The results (text, plots or error messages) are sent back to the browser and displayed directly below the code cell. Additionally, the Python instance on the server stores the variables created, so they can be used again in another cell.

Alternatively, you can access a remote server which is set up for Jupyter Notebook use, such as at

The benefits of Jupyter over, let’s say, Sublime/cmd, are convenience and presentation style. The Jupyter Notebook, because of the cell architecture, makes it easy to run one small piece of code at a time. This makes debugging much faster & more streamlined, as it’s easy to check intermediate variable values, types and so on. Further, it’s very easy to share your work, for example on, and it looks good too, with all plots rendered on page.

On the other hand, Jupyter Notebook is not really in the same category with ‘fully featured’ Python IDEs like PyDev or PyCharm. For one thing, the output is not a .py file, but a .ipynb file, which which is actually a text file holding the code and output in a JSON array. Development of python packages or applications is thus not really possible or expected in Jupyter. On the other hand, demonstration of existing packages and data analysis is the main point and expected use of Jupyter.

Check out my Jupyter Notebooks:

Statistical Diversions: Collinear Predictors

Collinear predictors present a challenge in model construction and interpretation. This topic is covered in intuitive and engaging style in Chapter 5 of the excellent book Statistical Rethinking, by Richard McElreath.

Collinear predictors refers to when predictor variables correlate strongly with one another. To see why this can be a problem, consider the following example.

Let’s say we have:

Independent variables: x,  y

Dependent variable: z

We can use R to generate sample data sets in the following way:

x = rnorm(50)
y = rnorm(50,mean = -x,sd = 1)

This generates a set of normally distributed samples, x, with mean 0 and sd 1. It then generates a sample of y, using -x as the mean for each sample. This produces correlated variables:

[1] -0.7490608

Correlation varies from -1 to +1, with 0 being uncorrelated and +/- 1 indicating perfect (linear) correlation.

We can plot this data in 2 dimensions to get a better picture:


As we can see, x & y are clearly correlated.

We can now generate a sample for z:

z = rnorm(50, mean = x+y)

Let’s take a look at the x-z and y-z plots:


Here x and z appear completely uncorrelated.


While y and z look to have a slight positive correlation. Is that all we can do?

We can also use built-in R functions to look at the point estimates for the slopes of a linear regression on these data sets:

LMxz = lm(z ~ x)


LMyz = lm(z ~ y)


As we suspected, the slope for x is close to 0, while for y it is larger and positive.

(Note that we have only looked at the point-estimates for the slopes without quantifying uncertainty. In general, it’s better to have an estimate for the uncertainty in model parameters, but we’ll leave that for another time.)

Is this the whole story? Let’s try the following:

LMxyz = lm(z ~ x + y)

x                      y
0.9391245     0.9768274

Here we can see that when the linear model includes both x & y, both slopes are positive and near 1. What happened to our other coefficients, -0.17 and 0.51?

Picture a three dimensional space, x-axis coming towards you, y-axis going left and right & z-axis going up and down.


Now remember, most of our data is actually scattered roughly around the line y = -x. Combine that with the plane z = x+y (roughly what we got from the multivariate linear model), and this is what it looks like:


The plane cuts through the origin, coming down, towards us and to the left, while going away, up and to the right. The line y =-x is indicated by a thin gray line. The intersection of the line and the plane occur at z=0 (!).

This is how the single predictor models don’t show an effect, while the multivariate does. The error in the data averages roughly around the plane as it cuts through the origin, so it is only after controlling for y that the effect of x can be determined (and vice versa).

The predictive power of an x-only or y-only model is not awful (though it is worse), so long as the new data still clusters around y=-x. However, the interpretation of the model is significantly clouded by the collinearity between x & y until the multivariate model is inspected.

Fun side note: the result we get nearly recovers the statistical process used to generate z ( z = rnorm(50, mean = x+y) ).


The py_emra package

I’m pleased to announce the py_emra package for Ensemble Modeling Robustness Analysis (EMRA). EMRA is a tool for modeling metabolic systems which I have worked on through my PhD.

The crux of EMRA is to use dynamic stability as a criteria for model selection and simulation in metabolic systems (e.g. bacterial metabolism). This depends on using what’s called a kinetic model of metabolism, one which uses a rate law to represent the amount of flux through any given reaction. Stability can be analyzed mathematically based on the specifics of these rate laws and how the reaction network is connected.

Understanding the stability of metabolic systems is important for redirecting them towards products of interest like biofuels. In the longer term, stability could be an interesting concept to explore in more complex cases like human diseases, particularly those with a metabolic component like diabetes or cancer.

As an exercise in Python coding and version control with Git, I decided to translate the existing EMRA code base from MATLAB into Python. Luckily, most of the features, including symbolic math and numerical integration, are available.

The Python package and readme is available on Github.

A paper about the package is available on biorXiv.

For more about EMRA see my Google Scholar page.