Category: Uncategorized

Information Criteria & Cross Validation

A problem of predictive models is overfitting. This happens when the model is too responsive and picks up trends that are due to quirks in a specific sample, and not reflective of general, underlying trends in the data process. The result is a model that doesn’t predict very well. A model can be made less responsive by regularization–i.e. penalizing the model for complexity–and/or by reducing the number of parameters.

Of course, a completely unresponsive model is not correct either, so a balance must be struck. But how much regularization is the right amount? How many parameters is the correct amount? There are (at least) two main ways to test: cross validation & information criteria.

Cross validation

Cross validation is the practice of leaving out some of the data from the training set (20-40% of the data) and using it to select between different models. The notion here is that the leave-out data is ‘fresh’ to the model, and is thus representative of new data that the model would face in production. The various models being selected between can all be tested against the leave-out data, and the one that scores the best is selected.

Information criteria

Information criteria are a slightly different approach. In general, they rely on calculating the joint likelihood of observing the data, under the model, and taking the negative log, a quantity known as ‘deviance’. The information criteria is some function of the deviance. The Akaike Information Criterion (AIC) is two times the deviance plus two times the number of parameters. The information criteria, just like cross validation, can be used for model selection.

Calculating deviance requires a model that includes error, not just a point estimate. (Otherwise: the likelihood of any data point is just zero). In some methods of model generation (i.e. normal equation for linear regression), the error isn’t explicitly calculated. Thus, information criteria can’t be used for these ‘off-the-shelf’ types of models.

When they are used

Information criteria are often used in the context of Bayesian modeling, when the model explicitly includes error, and determines uncertainty in all parameters. The information criteria are somewhat abstract but seem more soundly based, theoretically speaking.

In contrast, cross validation can be used even when error & uncertainty is not modeled. Additionally, cross validation is highly applied and the principle makes sense and appeals even to machine learning novices.

Overall, both methods are highly useful and informative. Information criteria may be more sound, theoretically speaking, and may appeal to academic types for this reason. In general, however, more people are likely to be familiar with cross-validation, and it’s probably easier to explain and sell this technique to a non-technical audience.


A brief aside: Video Productions

On a slightly different note, I’d like to showcase my award-winning video production here.

These two won a total of $2500 in a competition sponsored by UCLA Department of Chemical and Biomolecular Engineering:

These were produced for UCLA California NanoSystems Institute.

These two bike safety videos each won $200 in a contest sponsored by UCLA Bicycle Coalition.

Jupyter Notebook Explorations

I’ve recently been getting to know the Jupyter Notebook environment. It’s a very convenient development environment for many languages, but is well-known as the evolution of the iPython Notebook.

Jupyter works by starting a server on your own computer to execute python (or another language) code. The notebook is rendered in a web browser and consists of cells of code. Cells can be run individually, which sends the code to the Jupyter server on your computer. The results (text, plots or error messages) are sent back to the browser and displayed directly below the code cell. Additionally, the Python instance on the server stores the variables created, so they can be used again in another cell.

Alternatively, you can access a remote server which is set up for Jupyter Notebook use, such as at

The benefits of Jupyter over, let’s say, Sublime/cmd, are convenience and presentation style. The Jupyter Notebook, because of the cell architecture, makes it easy to run one small piece of code at a time. This makes debugging much faster & more streamlined, as it’s easy to check intermediate variable values, types and so on. Further, it’s very easy to share your work, for example on, and it looks good too, with all plots rendered on page.

On the other hand, Jupyter Notebook is not really in the same category with ‘fully featured’ Python IDEs like PyDev or PyCharm. For one thing, the output is not a .py file, but a .ipynb file, which which is actually a text file holding the code and output in a JSON array. Development of python packages or applications is thus not really possible or expected in Jupyter. On the other hand, demonstration of existing packages and data analysis is the main point and expected use of Jupyter.

Check out my Jupyter Notebooks:

Statistical Diversions: Collinear Predictors

Collinear predictors present a challenge in model construction and interpretation. This topic is covered in intuitive and engaging style in Chapter 5 of the excellent book Statistical Rethinking, by Richard McElreath.

Collinear predictors refers to when predictor variables correlate strongly with one another. To see why this can be a problem, consider the following example.

Let’s say we have:

Independent variables: x,  y

Dependent variable: z

We can use R to generate sample data sets in the following way:

x = rnorm(50)
y = rnorm(50,mean = -x,sd = 1)

This generates a set of normally distributed samples, x, with mean 0 and sd 1. It then generates a sample of y, using -x as the mean for each sample. This produces correlated variables:

[1] -0.7490608

Correlation varies from -1 to +1, with 0 being uncorrelated and +/- 1 indicating perfect (linear) correlation.

We can plot this data in 2 dimensions to get a better picture:


As we can see, x & y are clearly correlated.

We can now generate a sample for z:

z = rnorm(50, mean = x+y)

Let’s take a look at the x-z and y-z plots:


Here x and z appear completely uncorrelated.


While y and z look to have a slight positive correlation. Is that all we can do?

We can also use built-in R functions to look at the point estimates for the slopes of a linear regression on these data sets:

LMxz = lm(z ~ x)


LMyz = lm(z ~ y)


As we suspected, the slope for x is close to 0, while for y it is larger and positive.

(Note that we have only looked at the point-estimates for the slopes without quantifying uncertainty. In general, it’s better to have an estimate for the uncertainty in model parameters, but we’ll leave that for another time.)

Is this the whole story? Let’s try the following:

LMxyz = lm(z ~ x + y)

x                      y
0.9391245     0.9768274

Here we can see that when the linear model includes both x & y, both slopes are positive and near 1. What happened to our other coefficients, -0.17 and 0.51?

Picture a three dimensional space, x-axis coming towards you, y-axis going left and right & z-axis going up and down.


Now remember, most of our data is actually scattered roughly around the line y = -x. Combine that with the plane z = x+y (roughly what we got from the multivariate linear model), and this is what it looks like:


The plane cuts through the origin, coming down, towards us and to the left, while going away, up and to the right. The line y =-x is indicated by a thin gray line. The intersection of the line and the plane occur at z=0 (!).

This is how the single predictor models don’t show an effect, while the multivariate does. The error in the data averages roughly around the plane as it cuts through the origin, so it is only after controlling for y that the effect of x can be determined (and vice versa).

The predictive power of an x-only or y-only model is not awful (though it is worse), so long as the new data still clusters around y=-x. However, the interpretation of the model is significantly clouded by the collinearity between x & y until the multivariate model is inspected.

Fun side note: the result we get nearly recovers the statistical process used to generate z ( z = rnorm(50, mean = x+y) ).


The py_emra package

I’m pleased to announce the py_emra package for Ensemble Modeling Robustness Analysis (EMRA). EMRA is a tool for modeling metabolic systems which I have worked on through my PhD.

The crux of EMRA is to use dynamic stability as a criteria for model selection and simulation in metabolic systems (e.g. bacterial metabolism). This depends on using what’s called a kinetic model of metabolism, one which uses a rate law to represent the amount of flux through any given reaction. Stability can be analyzed mathematically based on the specifics of these rate laws and how the reaction network is connected.

Understanding the stability of metabolic systems is important for redirecting them towards products of interest like biofuels. In the longer term, stability could be an interesting concept to explore in more complex cases like human diseases, particularly those with a metabolic component like diabetes or cancer.

As an exercise in Python coding and version control with Git, I decided to translate the existing EMRA code base from MATLAB into Python. Luckily, most of the features, including symbolic math and numerical integration, are available.

The Python package and readme is available on Github.

A paper about the package is available on biorXiv.

For more about EMRA see my Google Scholar page.

Distributed Computing: Motivation & File Systems

Why Distributed Computing?

Distributed computing is the practice of computing using a cluster of computers rather than a single one. Distributed computing comes with many advantages.

Work With Very Big Data

Work for today’s data scientists very often involves data sets which are too large to feasibly work with from one local computer. For simple installations of programming languages like R or Python, data larger than the available memory will trigger a memory error if attempted to load into memory.

There are ways to bypass these memory limits on a local system, by shuffling bits of a data set between memory and a single local hard disk (so-called incremental learning), but these methods are inconvenient, slow and not very scalable.

However, the availability of memory in a cluster of multiple computers is higher than a single commodity desktop, allowing for seamless use of massive data sets.

Analyze Data Much Faster

Not only is the amount of data that can be processed much higher, but the analysis can be done much more quickly, due to the availability of multiple processors across the cluster. This allows even computationally expensive tasks to be completed easily.

Convenience & Cost

In the past, high intensity calculations and large data sets were often handled by supercomputers. Supercomputers are highly specialized and expensive, compared to the cost and ease of creating or expanding a distributed cluster of commodity desktops.

Work on Massively Parallel Clusters of Commodity Computers

Distributed computing allows all of these issues to be resolved by using a coordinated group of normal, commodity desktops which communicate through a network. The different computers on a cluster need to be coordinated by a program known as a distributed computing framework. Hadoop MapReduce and Apache Spark are well-known examples.

Making the most efficient use of resources on a cluster is a significant problem, which these software projects have sought to solve.

Distributed File Systems

In this post, I’ll mention just one of the features that allows distributed computing to work, distributed file systems. A file system is a method for arranging data on a hard drive so that it’s clear where one file ends and the next one begins.

Local computers use well-developed file systems like FAT32 or NTFS to organize their hard drives.

Distributed clusters of computers use a distributed file system to keep track of the data stored on the cluster. A significant example is Hadoop Distributed File System (HDFS). HDFS is installed on all the computers across a cluster and works on top of each of the local file systems in a cluster.

HDFS uses network communication to make storage of very large files on a cluster convenient and fault-tolerant. It keeps at least 3 copies of each file on a cluster, in case one or more of the nodes (computers) in a cluster goes down.

Many distributed computing frameworks, including Hadoop MapReduce and Apache Spark work with HDFS.


Some of my favorite links

This is a post which I will update periodically with some of my favorite Quora (& Stackexchange) answers.

From Clay Ford’s blog:


Singular Value Decomposition:


Bayesian Updating, Part 2

In Part 1, I explored how Bayesian updating operates when there are two discrete possibilities. I now investigate how Bayesian updating operates with one continuous parameter.

This example is from Chapter 2 of ‘Statistical Rethinking’ by Richard McElreath. The premise can be paraphrased as follows:

Suppose you have a globe representing the Earth. It is small enough to hold. You are curious how much of the globe is covered in water. You will toss the globe in the air. When you catch it, you will record whether your right index finger is on land or water. Repeat and record your results.

The data story we use assumes that each toss is independent of the last and that our prior is a uniform distribution, according to the principle of indifference. In other words, all fractions of water, p, are considered equally likely. In reality, a globe that represents our earth must have some water and some land, based on our experience, so p can’t exactly equal 1 or 0. But for now, let us assume uniformity:


The chart here represents probability density, as opposed to the previous case which showed probability mass. Still, the total probability, represented as the area under the blue line, sums to one.

We make our first observation and it is water or W. This allows us to calculate a posterior distribution, but how?

Bayesian updating uses the following:

Posterior = Likelihood * Prior / Average Likelihood

In this case, the Posterior, Likelihood and Prior represent continuously varying quantities rather than discrete categories as in Part 1. I’ll touch on Average Likelihood later.

The Prior here is the uniform distribution, valued at 1. The Likelihood is the likelihood of observing W under any value of p. This is in fact the definition of p, so Likelihood is just p. If the observation were L for land, Likelihood would be (1-p), since there are only two possibilities.

Average likelihood is a single value which is used to rescale the posterior so that probabilities sum to one. It is the integral of the Likelihood*Prior over all possible values of p. In this case, it would be:

Integral(from p=0, to p=1)[p * 1]dp = 0.5

Thus, our posterior distribution after observing W is:

Posterior = p * 1 / 0.5 = 2p

This looks like the following:


Notice that the probability density exceeds 1. This is allowed, since it is the integral, not the value of the curve that must equal 1.

We’re not done yet! Let’s say that we spin the globe again but this time observe land, L. What do we do? With Bayesian updating, the posterior becomes the prior for continued observations.

Now, Prior is 2p, Likelihood is (1-p) and Average Likelihood is:

Integral(from p=0, to p=1)[(1-p) * 2p]dp = 0.33

Thus, our updated Posterior after observing W L is:

Posterior = (1-p) * 2p / 0.33 = 6p*(1-p)


This analysis can continue until, say, 9 observations are made:


In all, 6 W’s and 3 L’s are observed. Following previous trends, each W will add a factor of p and each L will add a factor of (1-p). The normalization factor can be calculated through integral and is 840.

Posterior = 840 * p^6 * (1-p)^3


The distribution has gotten much pointier and is looking very similar to Gaussian.

In principle, this method scales up to multiple parameters and much more complicated models. In practice, however, analytical solutions like we have here quickly become impractical, and numerical methods are used.

Grid approximation is one such method where each parameter is represented as a sufficient number (100-200 is usually enough) points rather than strictly continuous. However, with multiple parameters, the total size of the grid-space scales exponentially, so other solutions become important.

Bayesian Updating, Part 1

I’m reading the new statistics textbook ‘Statistical Rethinking’ by Richard McElreath. Chapter 2 is heavily about the topic of Bayesian updating. That is, producing a posterior distribution from a prior distribution and a new piece of data.

I’ll paraphrase one of the exercises here:

Suppose there are two species of panda bear. Both are equally common in the wild and live in the same places. They are exactly alike but differ in family size. Species A gives birth to twins 10% of the time, and one infant the remainder of the time. Species B has twins 20% of the time. Suppose a female panda of unknown species gives birth to twins. What is the probability her next birth will also be twins.

-Paraphrased from ‘Statistical Rethinking’ by Richard McElreath

Without explicit knowledge of Bayesian updating, I don’t immediately know how to proceed with this question. I have some vague notion that the probability should be around 15%, or possibly higher because the first birth was twins. But an exact numerical answer is out of reach.

Having read the chapter, such an exercise is straightforward. It’s a pure example of Bayesian updating.

The first step is to declare a data story, or method of framing the problem mathematically. We will assume that either panda variety is equally likely without further information. This is called the principle of indifference, in which possible outcomes are considered equally likely in the absence of data. This is our original prior distribution. We will also assume that our knowledge about panda variants is correct, without uncertainty, which is probably not a safe assumption in general, but is made explicit here as an example.


The first birth of twins is a data point, Twins1, that we can use to update our model. McElreath phrases Bayes Theorem as follows:

Posterior = Likelihood * Prior / Average Likelihood

Posterior is the distribution after including new data, Likelihood is the likelihood of observing the data under any of the real world states under consideration, Prior is the prior distribution, and Average Likelihood is the expected value of the data under the prior distribution.

We will first phrase Bayes as follows, for our case.

Pr(p|Twins1) = Pr(Twins1|p) * Pr(p) / Pr(Twins1)

Here, p is a slightly abstracted variable which includes likelihoods of both species A and species B. Twins1 is the datum of obeserving the first birth to be twins.

Pr(p|Twins1) is the posterior distribution, in other words, the distribution p given that we observe Twins1. This is what we want to calculate.

Pr(Twins1|p) is the likelihood of observing Twins1 under any of the known conditions, in other words, 10% likelihood of twins for species A, or 20% likelihood for species B.

Pr(p) is the prior distribution, which we decided based on the principle of indifference to be 50% species A, 50% species B.

Pr(Twins1) is the overal probability of observing Twins1, under the prior distribution. In other words:

Pr(Twins1) = Pr (Twins1|p) * Pr(p)

We can break this down into explicit terms for the different scenarios represented by p. SpA and SpB represent the probabilities of species A & B respectively.

Pr(Twins1) = Pr(Twins1|SpA) * Pr(SpA) + Pr (Twins1|SpB) * Pr(SpB)

Under our current prior.

Pr(Twins1) = 0.1 * 0.5 + 0.2 * 0.5 = 0.15

If we break down for the overall equation, we get two separate ones:

Pr(SpA|Twins1) = Pr(Twins1|SpA) * Pr(SpA) / Pr(Twins1)

Pr(SpB|Twins1) = Pr(Twins1|SpB) * Pr(SpB) / Pr(Twins1)

Our posterior distribution is thus:

Pr(SpA|Twins1) = 0.1 * 0.5 / 0.15 = 0.33

Pr(SpA|Twins1) = 0.2 * 0.5 / 0.15 = 0.67

Thus, the posterior distribution looks like this:


The probability of observing another set of twins, Twins2, is thus:

Pr(Twins2) = Pr(Twins2|SpA) * Pr(SpA) + Pr(Twins2|SpB) * Pr(SpB)

Pr(Twins2) = 0.1 * 0.33 + 0.2 * 0.67 = 0.167 = 16.7%

This is a relatively simple example. In other cases, we are not dealing with discrete categories (SpA & SpB) but a continuous parameter. For example, we may like to determine the actual likelihood of twin birth, call it q, without having known categories. I’ll deal with this case in an upcoming post.

For a more complete exploration of Bayes, see Arbital’s treatment, the LessWrong treatment or Wikipedia. Or even better, get ‘Statistical Rethinking’.

Learning Git

One of the biggest differences between writing academic code and writing production code is the emphasis on readability, maintainability and traceability. While academic code is often only needed once, and used by one or a few people with a low turnover rate, production code is used over and over again, must be adaptable to change and intelligible to anyone who needs to use it.

In my academic work, I wrote MATLAB code for myself, and maybe a few close collaborators. I didn’t worry about version control or documentation, any questions were easily dealt with, and my familiarity with the codebase made it easy for me to navigate. When a paper was published, I posted the complete code and brief instructions on Github using Git, but that was the extent of the documentation. In this case, my use of Git and Github is exclusively as a code storage facility and was not integrated with my development process.

Version control, and Git in particular, are much more powerful than this. I recently read through Atlassian’s Git Tutorial. It was one of the better online tutorials I’ve read, especially because of the excellent context provided throughout.

Just enough detail, just enough context.

The tutorial wasn’t just a rundown of git commands. Instead, it provides just enough history of version control and sample workflows that I feel I understand how to use Git in the context of an actual project, and why Git is usually preferred to centralized systems like SVN. It also mixes code snippets, but doesn’t overload on technical details I’m not going to remember for very long.

The tutorial makes technical points when necessary for comprehension and points out their most common uses in typical workflows. It makes the distinction between commits and the branches which point to them, for example, and how this leads to a difference between “git checkout <branch>” and “git checkout <commit>”. It also talks about the common practice of cleaning up the development history of a new branch into logical commits before merging with master. This is a practice which is no doubt critical to long-term project maintenance, but wouldn’t be obvious from a simple description of “git rebase -i”.

The tutorial doesn’t have enough information on its own to serve as a viable Git documentation, but then it doesn’t have to. The official Git documentation, and the excellent book Pro Git are always only a google search away. The importance of the tutorial is the ideas, structures and workflows it imparts to readers, who should then know what to google later when working on a project.

For a data scientist, Git and version control workflows are important skillsets. The development of models and data cleaning code should be traceable through Git, but knowledge of Git for application development is also key, as data scientists are inevitably involved in greater or lesser degree with that process as well.