Traversing TCGA: Building a Predictive Model

This is a post in the ongoing series Traversing TCGA.

So far, we’ve used data from The Cancer Genome Atlas to try to determine a correlation between cancer survival and one lab test result. However, we can make more use of the data, for example, by including many more attributes or ‘features’ in the predictive model. The data from TCGA does include many lab test results and other important patient data such as age and year diagnosed. We can use these to develop a predictive model for survival.

In previous posts, I’ve described how to extract data from the TCGA data file downloads. In my particular case, I’m using regex queries to extract relevant data from the XML files provided by TCGA. (Aside: it would probably be more efficient, computationally speaking, to use the xml package, but the dataset I’m working with is relatively small and it never takes more than 5 seconds to run on my ordinary desktop.) Here is a sample of the regex queries and category names:


These are then used to extract information from the XML document, which contains results in the form of numbers (integers) or containing the words ‘Negative’ or ‘Positive’ for mutation data, or strings describing the tumor morphology. These text strings are transduced with the following:


Now we can import all the data into python and clean it up!  One key decision is what kind of analysis to do. There are many entries for patients who have died, and many who have not. We can either make a numerical predictor for days until death, or a category predictor for DEATH or NO DEATH. I decided on first doing a numerical predictor. So what to do with the no-death instances? I decided to discard them to make this as clean a data problem as possible. (You could also decide to set the survival time to an arbitrary high value…I decided against that for now)

Interestingly, surviving patients are coded in the ‘days to death’ variable with a zero. Here is the function and function call I used to remove ‘zero’ values from the days to death variable and the corresponding instances from the feature variables:



The stats package from scipy provides a tidy function for bivariate linear regression. We can use it to find individual correlations between individual features and the outcome of interest:


We can now investigate which features were most predictive of survival, and the associated slopes and p-values:


It is clear that this is a very messy problem, statistically speaking, and we don’t get many (or any) clean predictors. Since we are using 24 different features, this is now a multiple hypothesis testing problem, and if we apply the classical MHT filters, such as Bonferroni, we end up with no statistically significant relations at the p<0.05/24 level.

However, in the absence of a crystal ball… these features are the best we’ve got. We can say probably, for instance, that ‘IDH R140’ p=0.08 is a better predictor than ‘Monocyte Percent’ p=0.83.

So what are we looking at? ‘Year Diagnosed’ and ‘Age Diagnosed’ are our strongest predictors. Year diagnosed corresponds to the calendar year in which the patient was diagnosed (i.e. 2005). It appears that for every year forward we go in time, the patient is predicted to survive about 38 less days. This is relatively unexpected, but is probably related to how we decided to clean the data. We got rid of all the surviving patients, so patients diagnosed in later years (i.e. 2011) that are included must have a relatively shorter survival time than those from earlier years. ‘Age Diagnosed’ indicates the patient’s age when diagnosed with AML (i.e. 45). Older patients are expected to survive for less time. A 10 year increase in age is associated with a 45 day decrease in expected survival.

So what does the overall predictive model look like? Well, it is a little tough to visualize the exact effect of each predictor, so instead, for each instance (patient), we can plot the predicted survival vs. actual. For a perfect predictor, every point would lie on the line y=x, along with the r-squared value for the overall model:


Rsquare Init

Right now, this is not a great predictor, but maybe we can get rid of some outliers. The five points to the right of x=1000 appear to be outliers. What happens if we get rid of them? We can use the following code:


Now how does our model look?



Not too shabby! We need to be conscious of overfitting, since we the number of features (24) is a relatively large fraction of the number of instances (~100) but this is a promising start!

As usual, all code is in the github.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s