This is an edition of Traversing TCGA, an ongoing project analyzing The Cancer Genome Atlas.
Update: I’ve added a Github repository containing the code for this project here! Check it out.
In the last post, I started extracting data from the XML files downloaded from TCGA. Now, I’ll begin to find trends in the data. I start with identifying some potentially interesting XML labels in the data. Here are a couple:
‘lab_procedute_blast_cell…’ is the result of a lab test which quantifies the percentage of blood cells which are abnormal cells called blasts. ‘days_to_death’ is a number which seems to quantify the number of days from the day the test was performed until the patient died. This number varies significantly with some entries over 1000. Some of the entries are 0, indicating the patient is still alive. The appropriate RegEx queries are then:
Now, we can investigate how these quantities might be related. First, we can ‘clean’ the data, eliminating entries for which either value equals zero.
We can now perform a linear regression and scatter plot.
The slope is slightly positive, but not statistically significant:
And the scatter plot looks like this:
This trend is not necessarily strong enough to be significant, but this shows how we might analyze data and find other important trends.