Traversing TCGA: Storing Data in PostgreSQL

This is a post in the ongoing series, Traversing TCGA.

So far we’ve been able to download, extract, organize and analyze The Cancer Genome Atlas data in Python. However, storing data as an internal variable in Python isn’t ideal for all scenarios. Storing data over the long term in a database, such as PostgreSQL, is a better solution. PostgreSQL and other database software allows for stable, structured data storage which can be accessed from anywhere and offers a host of powerful tools for analysis.

PostgreSQL (or just Postgres) installers are available here. During installation, port (default=5432) and password are set.

After installation, the database can be accessed from the psql program. The prompt looks like this:


Using the \dt command lists all tables in the database.


SQL queries like CREATE TABLE, and DROP TABLE are used to make and delete tables.

Postgres databases can also be accessed in python using the psycopg2 library as follows:


The default dbname and username at install is “postgres”. The host “localhost” accesses databases on the same computer, but could be replaced by an IP address to access a database remotely. The getData function refers to code I wrote earlier which extracts patient data.

The following command can be used to create a table:


This creates a table named “cancer_data” with a column called “patient_id” of type text and a column called “gender” of type text. The table can then be populated with the following:


The table is now populated with entries from the patient data extracted from XML data. These can be displayed with the following commands:


Which results in the data to display:


Showing we have successfully populated the table.

Full code is on github.


Traversing TCGA: Trying to Find Trends in the Data

This is an edition of Traversing TCGA, an ongoing project analyzing The Cancer Genome Atlas.

Update: I’ve added a Github repository containing the code for this project here! Check it out.

In the last post, I started extracting data from the XML files downloaded from TCGA. Now, I’ll begin to find trends in the data. I start with identifying some potentially interesting XML labels in the data. Here are a couple:



‘lab_procedute_blast_cell…’ is the result of a lab test which quantifies the percentage of blood cells which are abnormal cells called blasts. ‘days_to_death’ is a number which seems to quantify the number of days from the day the test was performed until the patient died. This number varies significantly with some entries over 1000. Some of the entries are 0, indicating the patient is still alive. The appropriate RegEx queries are then:


Now, we can investigate how these quantities might be related. First, we can ‘clean’ the data, eliminating entries for which either value equals zero.


We can now perform a linear regression and scatter plot.


The slope is slightly positive, but not statistically significant:


And the scatter plot looks like this:


This trend is not necessarily strong enough to be significant, but this shows how we might analyze data and find other important trends.

Traversing TCGA: Making Sense of the Data Files

This post is part of an ongoing series, Traversing TCGA, in which I analyze data from The Cancer Genome Atlas using Python.

Once the download of the data is complete, we end up with a folder full of .xml files containing the clinical data.

XML Files

How do we go from relatively free-from data in hundreds of separate .xml files to a table-like object we can use for data analysis? This process is sometimes called data munging or data wrangling. In this case, a good first step would be to get a list of all the file names in the directory. In python, this is accomplished by the command os.listdir:


We can now open each individual file and extract the relevant information that we want. For example, with the following for statement:


Now we have a for-loop which iterates through every file in the directory. But what exactly do these files look like? After opening in Notepad, here is what it looks like:


Each line is a data entry which uses HTML-like tags to label the data. The first line visible here is for ‘bcr_patient_barcode’ which has the record ‘TCGA-AB-2802’, apparently a unique identifier for this patient. If we use an appropriate regular expressions and functions, we can extract this or any other bit of relevant data we’d like. Here I’ve defined regex to get the patient barcode and patient gender:



Now, we can run the program and see what we get!


Success! The zero entries in the table are because some of the .xml files do not have an entry for gender. However, based on my observations, each patient’s barcode has a gender in at least one .xml file, so we can determine the gender for each entry in a later step.

The main thing we wanted to do was enter the data into a single table, rather than scattered throughout hundreds of separate files, which we achieved. I will continue with further data collection, cleaning and analysis in a later post.