Traversing TCGA: Trying to Find Trends in the Data

This is an edition of Traversing TCGA, an ongoing project analyzing The Cancer Genome Atlas.

Update: I’ve added a Github repository containing the code for this project here! Check it out.

In the last post, I started extracting data from the XML files downloaded from TCGA. Now, I’ll begin to find trends in the data. I start with identifying some potentially interesting XML labels in the data. Here are a couple:

Blastlabel

Daytodeathlabel

‘lab_procedute_blast_cell…’ is the result of a lab test which quantifies the percentage of blood cells which are abnormal cells called blasts. ‘days_to_death’ is a number which seems to quantify the number of days from the day the test was performed until the patient died. This number varies significantly with some entries over 1000. Some of the entries are 0, indicating the patient is still alive. The appropriate RegEx queries are then:

Regex

Now, we can investigate how these quantities might be related. First, we can ‘clean’ the data, eliminating entries for which either value equals zero.

Datacleaning

We can now perform a linear regression and scatter plot.

Plotcode

The slope is slightly positive, but not statistically significant:

Pvalnslope

And the scatter plot looks like this:

Plot

This trend is not necessarily strong enough to be significant, but this shows how we might analyze data and find other important trends.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s