This is a post in the ongoing series, Traversing TCGA.
So far we’ve been able to download, extract, organize and analyze The Cancer Genome Atlas data in Python. However, storing data as an internal variable in Python isn’t ideal for all scenarios. Storing data over the long term in a database, such as PostgreSQL, is a better solution. PostgreSQL and other database software allows for stable, structured data storage which can be accessed from anywhere and offers a host of powerful tools for analysis.
PostgreSQL (or just Postgres) installers are available here. During installation, port (default=5432) and password are set.
After installation, the database can be accessed from the psql program. The prompt looks like this:
Using the \dt command lists all tables in the database.
SQL queries like CREATE TABLE, and DROP TABLE are used to make and delete tables.
Postgres databases can also be accessed in python using the psycopg2 library as follows:
The default dbname and username at install is “postgres”. The host “localhost” accesses databases on the same computer, but could be replaced by an IP address to access a database remotely. The getData function refers to code I wrote earlier which extracts patient data.
The following command can be used to create a table:
This creates a table named “cancer_data” with a column called “patient_id” of type text and a column called “gender” of type text. The table can then be populated with the following:
The table is now populated with entries from the patient data extracted from XML data. These can be displayed with the following commands:
Which results in the data to display:
Showing we have successfully populated the table.
Full code is on github.