Traversing TCGA: Downloading the Data

The Cancer Genome Atlas (TCGA) is an amazing source of data about cancer. It contains clinical observations, gene panel data and genome sequence data for thousands of cancer patients.

I have initiated a series called Traversing TCGA, which I will update periodically with my insights about the data contained in TCGA. First, I chose to investigate the acute myeloid leukemia (LAML) database. LAML is a disease of the blood cell progenitor (myeloid) cells of the bone marrow. Uncontrolled proliferation of mutated cells .

I have accessed the LAML portion of TCGA, pictured below:


and have decided to download the clinical XML data (selected), as well as the BI SNP data, column 3. The clinical XML data should include information about the patient, including age, treatments undertaken, and responses to treatment. Analysis of this information alone may prove interesting, if only for evaluating the different treatments, and the associated patient outcomes. However, the inclusion of an SNP (single nucleotide polymorphism) panel for each patient provides yet another way to analyze the data. For example, I can attempt to determine if certain combinations of SNP and treatment are more likely to result in positive outcomes than others.

After selecting the desired columns, I downloaded the data as a 160 MB .tar file:


I plan to analyze the data in Python, possibly using the Pandas library, which is optimized to work with larger data sets. I will update the blog as I proceed…