This post is part of an ongoing series, Traversing TCGA, in which I analyze data from The Cancer Genome Atlas using Python.
Once the download of the data is complete, we end up with a folder full of .xml files containing the clinical data.
How do we go from relatively free-from data in hundreds of separate .xml files to a table-like object we can use for data analysis? This process is sometimes called data munging or data wrangling. In this case, a good first step would be to get a list of all the file names in the directory. In python, this is accomplished by the command os.listdir:
We can now open each individual file and extract the relevant information that we want. For example, with the following for statement:
Now we have a for-loop which iterates through every file in the directory. But what exactly do these files look like? After opening in Notepad, here is what it looks like:
Each line is a data entry which uses HTML-like tags to label the data. The first line visible here is for ‘bcr_patient_barcode’ which has the record ‘TCGA-AB-2802’, apparently a unique identifier for this patient. If we use an appropriate regular expressions and functions, we can extract this or any other bit of relevant data we’d like. Here I’ve defined regex to get the patient barcode and patient gender:
Now, we can run the program and see what we get!
Success! The zero entries in the table are because some of the .xml files do not have an entry for gender. However, based on my observations, each patient’s barcode has a gender in at least one .xml file, so we can determine the gender for each entry in a later step.
The main thing we wanted to do was enter the data into a single table, rather than scattered throughout hundreds of separate files, which we achieved. I will continue with further data collection, cleaning and analysis in a later post.