Category: Uncategorized

Jupyter Notebook Explorations

I’ve recently been getting to know the Jupyter Notebook environment. It’s a very convenient development environment for many languages, but is well-known as the evolution of the iPython Notebook.

Jupyter works by starting a server on your own computer to execute python (or another language) code. The notebook is rendered in a web browser and consists of cells of code. Cells can be run individually, which sends the code to the Jupyter server on your computer. The results (text, plots or error messages) are sent back to the browser and displayed directly below the code cell. Additionally, the Python instance on the server stores the variables created, so they can be used again in another cell.

Alternatively, you can access a remote server which is set up for Jupyter Notebook use, such as at try.jupyter.org.

The benefits of Jupyter over, let’s say, Sublime/cmd, are convenience and presentation style. The Jupyter Notebook, because of the cell architecture, makes it easy to run one small piece of code at a time. This makes debugging much faster & more streamlined, as it’s easy to check intermediate variable values, types and so on. Further, it’s very easy to share your work, for example on nbviewer.jupyter.org, and it looks good too, with all plots rendered on page.

On the other hand, Jupyter Notebook is not really in the same category with ‘fully featured’ Python IDEs like PyDev or PyCharm. For one thing, the output is not a .py file, but a .ipynb file, which which is actually a text file holding the code and output in a JSON array. Development of python packages or applications is thus not really possible or expected in Jupyter. On the other hand, demonstration of existing packages and data analysis is the main point and expected use of Jupyter.

Check out my Jupyter Notebooks:

Advertisements

Statistical Diversions: Collinear Predictors

Collinear predictors present a challenge in model construction and interpretation. This topic is covered in intuitive and engaging style in Chapter 5 of the excellent book Statistical Rethinking, by Richard McElreath.

Collinear predictors refers to when predictor variables correlate strongly with one another. To see why this can be a problem, consider the following example.

Let’s say we have:

Independent variables: x,  y

Dependent variable: z

We can use R to generate sample data sets in the following way:

x = rnorm(50)
y = rnorm(50,mean = -x,sd = 1)

This generates a set of normally distributed samples, x, with mean 0 and sd 1. It then generates a sample of y, using -x as the mean for each sample. This produces correlated variables:

cor(x,y)
[1] -0.7490608

Correlation varies from -1 to +1, with 0 being uncorrelated and +/- 1 indicating perfect (linear) correlation.

We can plot this data in 2 dimensions to get a better picture:

XY

As we can see, x & y are clearly correlated.

We can now generate a sample for z:

z = rnorm(50, mean = x+y)

Let’s take a look at the x-z and y-z plots:

XZ

Here x and z appear completely uncorrelated.

YZ

While y and z look to have a slight positive correlation. Is that all we can do?

We can also use built-in R functions to look at the point estimates for the slopes of a linear regression on these data sets:

LMxz = lm(z ~ x)
LMxz$coefficients[2]

x
-0.1678911

LMyz = lm(z ~ y)
LMyz$coefficients[2]

y
0.5118612

As we suspected, the slope for x is close to 0, while for y it is larger and positive.

(Note that we have only looked at the point-estimates for the slopes without quantifying uncertainty. In general, it’s better to have an estimate for the uncertainty in model parameters, but we’ll leave that for another time.)

Is this the whole story? Let’s try the following:

LMxyz = lm(z ~ x + y)
LMxyz$coefficients[2:3]

x                      y
0.9391245     0.9768274

Here we can see that when the linear model includes both x & y, both slopes are positive and near 1. What happened to our other coefficients, -0.17 and 0.51?

Picture a three dimensional space, x-axis coming towards you, y-axis going left and right & z-axis going up and down.

XYZ

Now remember, most of our data is actually scattered roughly around the line y = -x. Combine that with the plane z = x+y (roughly what we got from the multivariate linear model), and this is what it looks like:

XYZpl

The plane cuts through the origin, coming down, towards us and to the left, while going away, up and to the right. The line y =-x is indicated by a thin gray line. The intersection of the line and the plane occur at z=0 (!).

This is how the single predictor models don’t show an effect, while the multivariate does. The error in the data averages roughly around the plane as it cuts through the origin, so it is only after controlling for y that the effect of x can be determined (and vice versa).

The predictive power of an x-only or y-only model is not awful (though it is worse), so long as the new data still clusters around y=-x. However, the interpretation of the model is significantly clouded by the collinearity between x & y until the multivariate model is inspected.

Fun side note: the result we get nearly recovers the statistical process used to generate z ( z = rnorm(50, mean = x+y) ).

 

The py_emra package

I’m pleased to announce the py_emra package for Ensemble Modeling Robustness Analysis (EMRA). EMRA is a tool for modeling metabolic systems which I have worked on through my PhD.

The crux of EMRA is to use dynamic stability as a criteria for model selection and simulation in metabolic systems (e.g. bacterial metabolism). This depends on using what’s called a kinetic model of metabolism, one which uses a rate law to represent the amount of flux through any given reaction. Stability can be analyzed mathematically based on the specifics of these rate laws and how the reaction network is connected.

Understanding the stability of metabolic systems is important for redirecting them towards products of interest like biofuels. In the longer term, stability could be an interesting concept to explore in more complex cases like human diseases, particularly those with a metabolic component like diabetes or cancer.

As an exercise in Python coding and version control with Git, I decided to translate the existing EMRA code base from MATLAB into Python. Luckily, most of the features, including symbolic math and numerical integration, are available.

The Python package and readme is available on Github.

A paper about the package is available on biorXiv.

For more about EMRA see my Google Scholar page.

Distributed Computing: Motivation & File Systems

Why Distributed Computing?

Distributed computing is the practice of computing using a cluster of computers rather than a single one. Distributed computing comes with many advantages.

Work With Very Big Data

Work for today’s data scientists very often involves data sets which are too large to feasibly work with from one local computer. For simple installations of programming languages like R or Python, data larger than the available memory will trigger a memory error if attempted to load into memory.

There are ways to bypass these memory limits on a local system, by shuffling bits of a data set between memory and a single local hard disk (so-called incremental learning), but these methods are inconvenient, slow and not very scalable.

However, the availability of memory in a cluster of multiple computers is higher than a single commodity desktop, allowing for seamless use of massive data sets.

Analyze Data Much Faster

Not only is the amount of data that can be processed much higher, but the analysis can be done much more quickly, due to the availability of multiple processors across the cluster. This allows even computationally expensive tasks to be completed easily.

Convenience & Cost

In the past, high intensity calculations and large data sets were often handled by supercomputers. Supercomputers are highly specialized and expensive, compared to the cost and ease of creating or expanding a distributed cluster of commodity desktops.

Work on Massively Parallel Clusters of Commodity Computers

Distributed computing allows all of these issues to be resolved by using a coordinated group of normal, commodity desktops which communicate through a network. The different computers on a cluster need to be coordinated by a program known as a distributed computing framework. Hadoop MapReduce and Apache Spark are well-known examples.

Making the most efficient use of resources on a cluster is a significant problem, which these software projects have sought to solve.

Distributed File Systems

In this post, I’ll mention just one of the features that allows distributed computing to work, distributed file systems. A file system is a method for arranging data on a hard drive so that it’s clear where one file ends and the next one begins.

Local computers use well-developed file systems like FAT32 or NTFS to organize their hard drives.

Distributed clusters of computers use a distributed file system to keep track of the data stored on the cluster. A significant example is Hadoop Distributed File System (HDFS). HDFS is installed on all the computers across a cluster and works on top of each of the local file systems in a cluster.

HDFS uses network communication to make storage of very large files on a cluster convenient and fault-tolerant. It keeps at least 3 copies of each file on a cluster, in case one or more of the nodes (computers) in a cluster goes down.

Many distributed computing frameworks, including Hadoop MapReduce and Apache Spark work with HDFS.

 

Some of my favorite links

This is a post which I will update periodically with some of my favorite Quora (& Stackexchange) answers.

From Clay Ford’s blog:

HackerNews:

Singular Value Decomposition:

Other:

Bayesian Updating, Part 2

In Part 1, I explored how Bayesian updating operates when there are two discrete possibilities. I now investigate how Bayesian updating operates with one continuous parameter.

This example is from Chapter 2 of ‘Statistical Rethinking’ by Richard McElreath. The premise can be paraphrased as follows:

Suppose you have a globe representing the Earth. It is small enough to hold. You are curious how much of the globe is covered in water. You will toss the globe in the air. When you catch it, you will record whether your right index finger is on land or water. Repeat and record your results.

The data story we use assumes that each toss is independent of the last and that our prior is a uniform distribution, according to the principle of indifference. In other words, all fractions of water, p, are considered equally likely. In reality, a globe that represents our earth must have some water and some land, based on our experience, so p can’t exactly equal 1 or 0. But for now, let us assume uniformity:

OrigPrior

The chart here represents probability density, as opposed to the previous case which showed probability mass. Still, the total probability, represented as the area under the blue line, sums to one.

We make our first observation and it is water or W. This allows us to calculate a posterior distribution, but how?

Bayesian updating uses the following:

Posterior = Likelihood * Prior / Average Likelihood

In this case, the Posterior, Likelihood and Prior represent continuously varying quantities rather than discrete categories as in Part 1. I’ll touch on Average Likelihood later.

The Prior here is the uniform distribution, valued at 1. The Likelihood is the likelihood of observing W under any value of p. This is in fact the definition of p, so Likelihood is just p. If the observation were L for land, Likelihood would be (1-p), since there are only two possibilities.

Average likelihood is a single value which is used to rescale the posterior so that probabilities sum to one. It is the integral of the Likelihood*Prior over all possible values of p. In this case, it would be:

Integral(from p=0, to p=1)[p * 1]dp = 0.5

Thus, our posterior distribution after observing W is:

Posterior = p * 1 / 0.5 = 2p

This looks like the following:

Post1

Notice that the probability density exceeds 1. This is allowed, since it is the integral, not the value of the curve that must equal 1.

We’re not done yet! Let’s say that we spin the globe again but this time observe land, L. What do we do? With Bayesian updating, the posterior becomes the prior for continued observations.

Now, Prior is 2p, Likelihood is (1-p) and Average Likelihood is:

Integral(from p=0, to p=1)[(1-p) * 2p]dp = 0.33

Thus, our updated Posterior after observing W L is:

Posterior = (1-p) * 2p / 0.33 = 6p*(1-p)

Post2

This analysis can continue until, say, 9 observations are made:

W L W W W L W L W

In all, 6 W’s and 3 L’s are observed. Following previous trends, each W will add a factor of p and each L will add a factor of (1-p). The normalization factor can be calculated through integral and is 840.

Posterior = 840 * p^6 * (1-p)^3

PostF

The distribution has gotten much pointier and is looking very similar to Gaussian.

In principle, this method scales up to multiple parameters and much more complicated models. In practice, however, analytical solutions like we have here quickly become impractical, and numerical methods are used.

Grid approximation is one such method where each parameter is represented as a sufficient number (100-200 is usually enough) points rather than strictly continuous. However, with multiple parameters, the total size of the grid-space scales exponentially, so other solutions become important.

Bayesian Updating, Part 1

I’m reading the new statistics textbook ‘Statistical Rethinking’ by Richard McElreath. Chapter 2 is heavily about the topic of Bayesian updating. That is, producing a posterior distribution from a prior distribution and a new piece of data.

I’ll paraphrase one of the exercises here:

Suppose there are two species of panda bear. Both are equally common in the wild and live in the same places. They are exactly alike but differ in family size. Species A gives birth to twins 10% of the time, and one infant the remainder of the time. Species B has twins 20% of the time. Suppose a female panda of unknown species gives birth to twins. What is the probability her next birth will also be twins.

-Paraphrased from ‘Statistical Rethinking’ by Richard McElreath

Without explicit knowledge of Bayesian updating, I don’t immediately know how to proceed with this question. I have some vague notion that the probability should be around 15%, or possibly higher because the first birth was twins. But an exact numerical answer is out of reach.

Having read the chapter, such an exercise is straightforward. It’s a pure example of Bayesian updating.

The first step is to declare a data story, or method of framing the problem mathematically. We will assume that either panda variety is equally likely without further information. This is called the principle of indifference, in which possible outcomes are considered equally likely in the absence of data. This is our original prior distribution. We will also assume that our knowledge about panda variants is correct, without uncertainty, which is probably not a safe assumption in general, but is made explicit here as an example.

OrigPrior

The first birth of twins is a data point, Twins1, that we can use to update our model. McElreath phrases Bayes Theorem as follows:

Posterior = Likelihood * Prior / Average Likelihood

Posterior is the distribution after including new data, Likelihood is the likelihood of observing the data under any of the real world states under consideration, Prior is the prior distribution, and Average Likelihood is the expected value of the data under the prior distribution.

We will first phrase Bayes as follows, for our case.

Pr(p|Twins1) = Pr(Twins1|p) * Pr(p) / Pr(Twins1)

Here, p is a slightly abstracted variable which includes likelihoods of both species A and species B. Twins1 is the datum of obeserving the first birth to be twins.

Pr(p|Twins1) is the posterior distribution, in other words, the distribution p given that we observe Twins1. This is what we want to calculate.

Pr(Twins1|p) is the likelihood of observing Twins1 under any of the known conditions, in other words, 10% likelihood of twins for species A, or 20% likelihood for species B.

Pr(p) is the prior distribution, which we decided based on the principle of indifference to be 50% species A, 50% species B.

Pr(Twins1) is the overal probability of observing Twins1, under the prior distribution. In other words:

Pr(Twins1) = Pr (Twins1|p) * Pr(p)

We can break this down into explicit terms for the different scenarios represented by p. SpA and SpB represent the probabilities of species A & B respectively.

Pr(Twins1) = Pr(Twins1|SpA) * Pr(SpA) + Pr (Twins1|SpB) * Pr(SpB)

Under our current prior.

Pr(Twins1) = 0.1 * 0.5 + 0.2 * 0.5 = 0.15

If we break down for the overall equation, we get two separate ones:

Pr(SpA|Twins1) = Pr(Twins1|SpA) * Pr(SpA) / Pr(Twins1)

Pr(SpB|Twins1) = Pr(Twins1|SpB) * Pr(SpB) / Pr(Twins1)

Our posterior distribution is thus:

Pr(SpA|Twins1) = 0.1 * 0.5 / 0.15 = 0.33

Pr(SpA|Twins1) = 0.2 * 0.5 / 0.15 = 0.67

Thus, the posterior distribution looks like this:

PostDist

The probability of observing another set of twins, Twins2, is thus:

Pr(Twins2) = Pr(Twins2|SpA) * Pr(SpA) + Pr(Twins2|SpB) * Pr(SpB)

Pr(Twins2) = 0.1 * 0.33 + 0.2 * 0.67 = 0.167 = 16.7%

This is a relatively simple example. In other cases, we are not dealing with discrete categories (SpA & SpB) but a continuous parameter. For example, we may like to determine the actual likelihood of twin birth, call it q, without having known categories. I’ll deal with this case in an upcoming post.

For a more complete exploration of Bayes, see Arbital’s treatment, the LessWrong treatment or Wikipedia. Or even better, get ‘Statistical Rethinking’.

Learning Git

One of the biggest differences between writing academic code and writing production code is the emphasis on readability, maintainability and traceability. While academic code is often only needed once, and used by one or a few people with a low turnover rate, production code is used over and over again, must be adaptable to change and intelligible to anyone who needs to use it.

In my academic work, I wrote MATLAB code for myself, and maybe a few close collaborators. I didn’t worry about version control or documentation, any questions were easily dealt with, and my familiarity with the codebase made it easy for me to navigate. When a paper was published, I posted the complete code and brief instructions on Github using Git, but that was the extent of the documentation. In this case, my use of Git and Github is exclusively as a code storage facility and was not integrated with my development process.

Version control, and Git in particular, are much more powerful than this. I recently read through Atlassian’s Git Tutorial. It was one of the better online tutorials I’ve read, especially because of the excellent context provided throughout.

Capture.PNG
Just enough detail, just enough context.

The tutorial wasn’t just a rundown of git commands. Instead, it provides just enough history of version control and sample workflows that I feel I understand how to use Git in the context of an actual project, and why Git is usually preferred to centralized systems like SVN. It also mixes code snippets, but doesn’t overload on technical details I’m not going to remember for very long.

The tutorial makes technical points when necessary for comprehension and points out their most common uses in typical workflows. It makes the distinction between commits and the branches which point to them, for example, and how this leads to a difference between “git checkout <branch>” and “git checkout <commit>”. It also talks about the common practice of cleaning up the development history of a new branch into logical commits before merging with master. This is a practice which is no doubt critical to long-term project maintenance, but wouldn’t be obvious from a simple description of “git rebase -i”.

The tutorial doesn’t have enough information on its own to serve as a viable Git documentation, but then it doesn’t have to. The official Git documentation, and the excellent book Pro Git are always only a google search away. The importance of the tutorial is the ideas, structures and workflows it imparts to readers, who should then know what to google later when working on a project.

For a data scientist, Git and version control workflows are important skillsets. The development of models and data cleaning code should be traceable through Git, but knowledge of Git for application development is also key, as data scientists are inevitably involved in greater or lesser degree with that process as well.

Traversing TCGA: Implementing an Interactive Web Application

This is the 6th and (probably?) final post in my series Traversing TCGA.

So far, I have downloaded, extracted cleaned and analyzed data from The Cancer Genome Atlas for acute myeloid leukemia (AML) patients. I have used python to perform linear regressions of patient survival against various clinical data.

Now I have made it available as an interactive web application.

Appl

This application allows users to select which features to include in the model, then Submit. A predictive model with is generated with linear reagression and the predicted survival is plotted vs. actual. If the prediction is good, there is a positive correlation. If not, there is little or no correlation.

The web tool uses a PostgreSQL database to store data. In this case, the size of the data is small, and it likely won’t be updated regularly, so a relational database (RDB) like Postgres is not really necessary. The data could simply be stored as a .csv or .xml on the server instead. However, I decided to use Postgres to teach myself how it works and can interact with a web application.

The web page itself uses checkboxes to allow users to select which features to include in the model. Each checkbox has an attribute (‘value’) which uniquely corresponds to the name of a column in the database. Upon clicking submit, the user’s browser sends a list of all these values to the server. The server queries the database for the relevant feature columns and the target variable (survival). Then the server performs a linear regression fit and sends the predicted and actual survivals to the browser.

The communication between browser and server occurs via AJAX, handled by a jQuery.getJSON() request, and python Flask, listening on the server. The Flask app activates Python code to run–scikit learn for statistics and psycopg2 for querying the Postgres database–and generate the linear regression and send the data back to the browser, serialized via JSON. The user’s browser recieves the data and plots it using Plotly.

Please give the app a try. Comments/questions are very welcome.

Traversing TCGA: Building a Predictive Model

This is a post in the ongoing series Traversing TCGA.

So far, we’ve used data from The Cancer Genome Atlas to try to determine a correlation between cancer survival and one lab test result. However, we can make more use of the data, for example, by including many more attributes or ‘features’ in the predictive model. The data from TCGA does include many lab test results and other important patient data such as age and year diagnosed. We can use these to develop a predictive model for survival.

In previous posts, I’ve described how to extract data from the TCGA data file downloads. In my particular case, I’m using regex queries to extract relevant data from the XML files provided by TCGA. (Aside: it would probably be more efficient, computationally speaking, to use the xml package, but the dataset I’m working with is relatively small and it never takes more than 5 seconds to run on my ordinary desktop.) Here is a sample of the regex queries and category names:

RegexNam

These are then used to extract information from the XML document, which contains results in the form of numbers (integers) or containing the words ‘Negative’ or ‘Positive’ for mutation data, or strings describing the tumor morphology. These text strings are transduced with the following:

StringTransduce

Now we can import all the data into python and clean it up!  One key decision is what kind of analysis to do. There are many entries for patients who have died, and many who have not. We can either make a numerical predictor for days until death, or a category predictor for DEATH or NO DEATH. I decided on first doing a numerical predictor. So what to do with the no-death instances? I decided to discard them to make this as clean a data problem as possible. (You could also decide to set the survival time to an arbitrary high value…I decided against that for now)

Interestingly, surviving patients are coded in the ‘days to death’ variable with a zero. Here is the function and function call I used to remove ‘zero’ values from the days to death variable and the corresponding instances from the feature variables:

CleanFn

CallFn

The stats package from scipy provides a tidy function for bivariate linear regression. We can use it to find individual correlations between individual features and the outcome of interest:

IndivLinRegress

We can now investigate which features were most predictive of survival, and the associated slopes and p-values:

Pvals

It is clear that this is a very messy problem, statistically speaking, and we don’t get many (or any) clean predictors. Since we are using 24 different features, this is now a multiple hypothesis testing problem, and if we apply the classical MHT filters, such as Bonferroni, we end up with no statistically significant relations at the p<0.05/24 level.

However, in the absence of a crystal ball… these features are the best we’ve got. We can say probably, for instance, that ‘IDH R140’ p=0.08 is a better predictor than ‘Monocyte Percent’ p=0.83.

So what are we looking at? ‘Year Diagnosed’ and ‘Age Diagnosed’ are our strongest predictors. Year diagnosed corresponds to the calendar year in which the patient was diagnosed (i.e. 2005). It appears that for every year forward we go in time, the patient is predicted to survive about 38 less days. This is relatively unexpected, but is probably related to how we decided to clean the data. We got rid of all the surviving patients, so patients diagnosed in later years (i.e. 2011) that are included must have a relatively shorter survival time than those from earlier years. ‘Age Diagnosed’ indicates the patient’s age when diagnosed with AML (i.e. 45). Older patients are expected to survive for less time. A 10 year increase in age is associated with a 45 day decrease in expected survival.

So what does the overall predictive model look like? Well, it is a little tough to visualize the exact effect of each predictor, so instead, for each instance (patient), we can plot the predicted survival vs. actual. For a perfect predictor, every point would lie on the line y=x, along with the r-squared value for the overall model:

InitPredictor

Rsquare Init

Right now, this is not a great predictor, but maybe we can get rid of some outliers. The five points to the right of x=1000 appear to be outliers. What happens if we get rid of them? We can use the following code:

Outliers

Now how does our model look?

Predictor

Rsquare

Not too shabby! We need to be conscious of overfitting, since we the number of features (24) is a relatively large fraction of the number of instances (~100) but this is a promising start!

As usual, all code is in the github.