Traversing TCGA: Making Sense of the Data Files

This post is part of an ongoing series, Traversing TCGA, in which I analyze data from The Cancer Genome Atlas using Python.

Once the download of the data is complete, we end up with a folder full of .xml files containing the clinical data.

XML Files

How do we go from relatively free-from data in hundreds of separate .xml files to a table-like object we can use for data analysis? This process is sometimes called data munging or data wrangling. In this case, a good first step would be to get a list of all the file names in the directory. In python, this is accomplished by the command os.listdir:


We can now open each individual file and extract the relevant information that we want. For example, with the following for statement:


Now we have a for-loop which iterates through every file in the directory. But what exactly do these files look like? After opening in Notepad, here is what it looks like:


Each line is a data entry which uses HTML-like tags to label the data. The first line visible here is for ‘bcr_patient_barcode’ which has the record ‘TCGA-AB-2802’, apparently a unique identifier for this patient. If we use an appropriate regular expressions and functions, we can extract this or any other bit of relevant data we’d like. Here I’ve defined regex to get the patient barcode and patient gender:



Now, we can run the program and see what we get!


Success! The zero entries in the table are because some of the .xml files do not have an entry for gender. However, based on my observations, each patient’s barcode has a gender in at least one .xml file, so we can determine the gender for each entry in a later step.

The main thing we wanted to do was enter the data into a single table, rather than scattered throughout hundreds of separate files, which we achieved. I will continue with further data collection, cleaning and analysis in a later post.


Traversing TCGA: Downloading the Data

The Cancer Genome Atlas (TCGA) is an amazing source of data about cancer. It contains clinical observations, gene panel data and genome sequence data for thousands of cancer patients.

I have initiated a series called Traversing TCGA, which I will update periodically with my insights about the data contained in TCGA. First, I chose to investigate the acute myeloid leukemia (LAML) database. LAML is a disease of the blood cell progenitor (myeloid) cells of the bone marrow. Uncontrolled proliferation of mutated cells .

I have accessed the LAML portion of TCGA, pictured below:


and have decided to download the clinical XML data (selected), as well as the BI SNP data, column 3. The clinical XML data should include information about the patient, including age, treatments undertaken, and responses to treatment. Analysis of this information alone may prove interesting, if only for evaluating the different treatments, and the associated patient outcomes. However, the inclusion of an SNP (single nucleotide polymorphism) panel for each patient provides yet another way to analyze the data. For example, I can attempt to determine if certain combinations of SNP and treatment are more likely to result in positive outcomes than others.

After selecting the desired columns, I downloaded the data as a 160 MB .tar file:


I plan to analyze the data in Python, possibly using the Pandas library, which is optimized to work with larger data sets. I will update the blog as I proceed…