Distributed Computing: Motivation & File Systems

Why Distributed Computing?

Distributed computing is the practice of computing using a cluster of computers rather than a single one. Distributed computing comes with many advantages.

Work With Very Big Data

Work for today’s data scientists very often involves data sets which are too large to feasibly work with from one local computer. For simple installations of programming languages like R or Python, data larger than the available memory will trigger a memory error if attempted to load into memory.

There are ways to bypass these memory limits on a local system, by shuffling bits of a data set between memory and a single local hard disk (so-called incremental learning), but these methods are inconvenient, slow and not very scalable.

However, the availability of memory in a cluster of multiple computers is higher than a single commodity desktop, allowing for seamless use of massive data sets.

Analyze Data Much Faster

Not only is the amount of data that can be processed much higher, but the analysis can be done much more quickly, due to the availability of multiple processors across the cluster. This allows even computationally expensive tasks to be completed easily.

Convenience & Cost

In the past, high intensity calculations and large data sets were often handled by supercomputers. Supercomputers are highly specialized and expensive, compared to the cost and ease of creating or expanding a distributed cluster of commodity desktops.

Work on Massively Parallel Clusters of Commodity Computers

Distributed computing allows all of these issues to be resolved by using a coordinated group of normal, commodity desktops which communicate through a network. The different computers on a cluster need to be coordinated by a program known as a distributed computing framework. Hadoop MapReduce and Apache Spark are well-known examples.

Making the most efficient use of resources on a cluster is a significant problem, which these software projects have sought to solve.

Distributed File Systems

In this post, I’ll mention just one of the features that allows distributed computing to work, distributed file systems. A file system is a method for arranging data on a hard drive so that it’s clear where one file ends and the next one begins.

Local computers use well-developed file systems like FAT32 or NTFS to organize their hard drives.

Distributed clusters of computers use a distributed file system to keep track of the data stored on the cluster. A significant example is Hadoop Distributed File System (HDFS). HDFS is installed on all the computers across a cluster and works on top of each of the local file systems in a cluster.

HDFS uses network communication to make storage of very large files on a cluster convenient and fault-tolerant. It keeps at least 3 copies of each file on a cluster, in case one or more of the nodes (computers) in a cluster goes down.

Many distributed computing frameworks, including Hadoop MapReduce and Apache Spark work with HDFS.



Some of my favorite links

This is a post which I will update periodically with some of my favorite Quora (& Stackexchange) answers.

From Clay Ford’s blog:


Singular Value Decomposition:


Bayesian Updating, Part 2

In Part 1, I explored how Bayesian updating operates when there are two discrete possibilities. I now investigate how Bayesian updating operates with one continuous parameter.

This example is from Chapter 2 of ‘Statistical Rethinking’ by Richard McElreath. The premise can be paraphrased as follows:

Suppose you have a globe representing the Earth. It is small enough to hold. You are curious how much of the globe is covered in water. You will toss the globe in the air. When you catch it, you will record whether your right index finger is on land or water. Repeat and record your results.

The data story we use assumes that each toss is independent of the last and that our prior is a uniform distribution, according to the principle of indifference. In other words, all fractions of water, p, are considered equally likely. In reality, a globe that represents our earth must have some water and some land, based on our experience, so p can’t exactly equal 1 or 0. But for now, let us assume uniformity:


The chart here represents probability density, as opposed to the previous case which showed probability mass. Still, the total probability, represented as the area under the blue line, sums to one.

We make our first observation and it is water or W. This allows us to calculate a posterior distribution, but how?

Bayesian updating uses the following:

Posterior = Likelihood * Prior / Average Likelihood

In this case, the Posterior, Likelihood and Prior represent continuously varying quantities rather than discrete categories as in Part 1. I’ll touch on Average Likelihood later.

The Prior here is the uniform distribution, valued at 1. The Likelihood is the likelihood of observing W under any value of p. This is in fact the definition of p, so Likelihood is just p. If the observation were L for land, Likelihood would be (1-p), since there are only two possibilities.

Average likelihood is a single value which is used to rescale the posterior so that probabilities sum to one. It is the integral of the Likelihood*Prior over all possible values of p. In this case, it would be:

Integral(from p=0, to p=1)[p * 1]dp = 0.5

Thus, our posterior distribution after observing W is:

Posterior = p * 1 / 0.5 = 2p

This looks like the following:


Notice that the probability density exceeds 1. This is allowed, since it is the integral, not the value of the curve that must equal 1.

We’re not done yet! Let’s say that we spin the globe again but this time observe land, L. What do we do? With Bayesian updating, the posterior becomes the prior for continued observations.

Now, Prior is 2p, Likelihood is (1-p) and Average Likelihood is:

Integral(from p=0, to p=1)[(1-p) * 2p]dp = 0.33

Thus, our updated Posterior after observing W L is:

Posterior = (1-p) * 2p / 0.33 = 6p*(1-p)


This analysis can continue until, say, 9 observations are made:


In all, 6 W’s and 3 L’s are observed. Following previous trends, each W will add a factor of p and each L will add a factor of (1-p). The normalization factor can be calculated through integral and is 840.

Posterior = 840 * p^6 * (1-p)^3


The distribution has gotten much pointier and is looking very similar to Gaussian.

In principle, this method scales up to multiple parameters and much more complicated models. In practice, however, analytical solutions like we have here quickly become impractical, and numerical methods are used.

Grid approximation is one such method where each parameter is represented as a sufficient number (100-200 is usually enough) points rather than strictly continuous. However, with multiple parameters, the total size of the grid-space scales exponentially, so other solutions become important.

Bayesian Updating, Part 1

I’m reading the new statistics textbook ‘Statistical Rethinking’ by Richard McElreath. Chapter 2 is heavily about the topic of Bayesian updating. That is, producing a posterior distribution from a prior distribution and a new piece of data.

I’ll paraphrase one of the exercises here:

Suppose there are two species of panda bear. Both are equally common in the wild and live in the same places. They are exactly alike but differ in family size. Species A gives birth to twins 10% of the time, and one infant the remainder of the time. Species B has twins 20% of the time. Suppose a female panda of unknown species gives birth to twins. What is the probability her next birth will also be twins.

-Paraphrased from ‘Statistical Rethinking’ by Richard McElreath

Without explicit knowledge of Bayesian updating, I don’t immediately know how to proceed with this question. I have some vague notion that the probability should be around 15%, or possibly higher because the first birth was twins. But an exact numerical answer is out of reach.

Having read the chapter, such an exercise is straightforward. It’s a pure example of Bayesian updating.

The first step is to declare a data story, or method of framing the problem mathematically. We will assume that either panda variety is equally likely without further information. This is called the principle of indifference, in which possible outcomes are considered equally likely in the absence of data. This is our original prior distribution. We will also assume that our knowledge about panda variants is correct, without uncertainty, which is probably not a safe assumption in general, but is made explicit here as an example.


The first birth of twins is a data point, Twins1, that we can use to update our model. McElreath phrases Bayes Theorem as follows:

Posterior = Likelihood * Prior / Average Likelihood

Posterior is the distribution after including new data, Likelihood is the likelihood of observing the data under any of the real world states under consideration, Prior is the prior distribution, and Average Likelihood is the expected value of the data under the prior distribution.

We will first phrase Bayes as follows, for our case.

Pr(p|Twins1) = Pr(Twins1|p) * Pr(p) / Pr(Twins1)

Here, p is a slightly abstracted variable which includes likelihoods of both species A and species B. Twins1 is the datum of obeserving the first birth to be twins.

Pr(p|Twins1) is the posterior distribution, in other words, the distribution p given that we observe Twins1. This is what we want to calculate.

Pr(Twins1|p) is the likelihood of observing Twins1 under any of the known conditions, in other words, 10% likelihood of twins for species A, or 20% likelihood for species B.

Pr(p) is the prior distribution, which we decided based on the principle of indifference to be 50% species A, 50% species B.

Pr(Twins1) is the overal probability of observing Twins1, under the prior distribution. In other words:

Pr(Twins1) = Pr (Twins1|p) * Pr(p)

We can break this down into explicit terms for the different scenarios represented by p. SpA and SpB represent the probabilities of species A & B respectively.

Pr(Twins1) = Pr(Twins1|SpA) * Pr(SpA) + Pr (Twins1|SpB) * Pr(SpB)

Under our current prior.

Pr(Twins1) = 0.1 * 0.5 + 0.2 * 0.5 = 0.15

If we break down for the overall equation, we get two separate ones:

Pr(SpA|Twins1) = Pr(Twins1|SpA) * Pr(SpA) / Pr(Twins1)

Pr(SpB|Twins1) = Pr(Twins1|SpB) * Pr(SpB) / Pr(Twins1)

Our posterior distribution is thus:

Pr(SpA|Twins1) = 0.1 * 0.5 / 0.15 = 0.33

Pr(SpA|Twins1) = 0.2 * 0.5 / 0.15 = 0.67

Thus, the posterior distribution looks like this:


The probability of observing another set of twins, Twins2, is thus:

Pr(Twins2) = Pr(Twins2|SpA) * Pr(SpA) + Pr(Twins2|SpB) * Pr(SpB)

Pr(Twins2) = 0.1 * 0.33 + 0.2 * 0.67 = 0.167 = 16.7%

This is a relatively simple example. In other cases, we are not dealing with discrete categories (SpA & SpB) but a continuous parameter. For example, we may like to determine the actual likelihood of twin birth, call it q, without having known categories. I’ll deal with this case in an upcoming post.

For a more complete exploration of Bayes, see Arbital’s treatment, the LessWrong treatment or Wikipedia. Or even better, get ‘Statistical Rethinking’.

Learning Git

One of the biggest differences between writing academic code and writing production code is the emphasis on readability, maintainability and traceability. While academic code is often only needed once, and used by one or a few people with a low turnover rate, production code is used over and over again, must be adaptable to change and intelligible to anyone who needs to use it.

In my academic work, I wrote MATLAB code for myself, and maybe a few close collaborators. I didn’t worry about version control or documentation, any questions were easily dealt with, and my familiarity with the codebase made it easy for me to navigate. When a paper was published, I posted the complete code and brief instructions on Github using Git, but that was the extent of the documentation. In this case, my use of Git and Github is exclusively as a code storage facility and was not integrated with my development process.

Version control, and Git in particular, are much more powerful than this. I recently read through Atlassian’s Git Tutorial. It was one of the better online tutorials I’ve read, especially because of the excellent context provided throughout.

Just enough detail, just enough context.

The tutorial wasn’t just a rundown of git commands. Instead, it provides just enough history of version control and sample workflows that I feel I understand how to use Git in the context of an actual project, and why Git is usually preferred to centralized systems like SVN. It also mixes code snippets, but doesn’t overload on technical details I’m not going to remember for very long.

The tutorial makes technical points when necessary for comprehension and points out their most common uses in typical workflows. It makes the distinction between commits and the branches which point to them, for example, and how this leads to a difference between “git checkout <branch>” and “git checkout <commit>”. It also talks about the common practice of cleaning up the development history of a new branch into logical commits before merging with master. This is a practice which is no doubt critical to long-term project maintenance, but wouldn’t be obvious from a simple description of “git rebase -i”.

The tutorial doesn’t have enough information on its own to serve as a viable Git documentation, but then it doesn’t have to. The official Git documentation, and the excellent book Pro Git are always only a google search away. The importance of the tutorial is the ideas, structures and workflows it imparts to readers, who should then know what to google later when working on a project.

For a data scientist, Git and version control workflows are important skillsets. The development of models and data cleaning code should be traceable through Git, but knowledge of Git for application development is also key, as data scientists are inevitably involved in greater or lesser degree with that process as well.

Traversing TCGA: Implementing an Interactive Web Application

This is the 6th and (probably?) final post in my series Traversing TCGA.

So far, I have downloaded, extracted cleaned and analyzed data from The Cancer Genome Atlas for acute myeloid leukemia (AML) patients. I have used python to perform linear regressions of patient survival against various clinical data.

Now I have made it available as an interactive web application.


This application allows users to select which features to include in the model, then Submit. A predictive model with is generated with linear reagression and the predicted survival is plotted vs. actual. If the prediction is good, there is a positive correlation. If not, there is little or no correlation.

The web tool uses a PostgreSQL database to store data. In this case, the size of the data is small, and it likely won’t be updated regularly, so a relational database (RDB) like Postgres is not really necessary. The data could simply be stored as a .csv or .xml on the server instead. However, I decided to use Postgres to teach myself how it works and can interact with a web application.

The web page itself uses checkboxes to allow users to select which features to include in the model. Each checkbox has an attribute (‘value’) which uniquely corresponds to the name of a column in the database. Upon clicking submit, the user’s browser sends a list of all these values to the server. The server queries the database for the relevant feature columns and the target variable (survival). Then the server performs a linear regression fit and sends the predicted and actual survivals to the browser.

The communication between browser and server occurs via AJAX, handled by a jQuery.getJSON() request, and python Flask, listening on the server. The Flask app activates Python code to run–scikit learn for statistics and psycopg2 for querying the Postgres database–and generate the linear regression and send the data back to the browser, serialized via JSON. The user’s browser recieves the data and plots it using Plotly.

Please give the app a try. Comments/questions are very welcome.

Traversing TCGA: Building a Predictive Model

This is a post in the ongoing series Traversing TCGA.

So far, we’ve used data from The Cancer Genome Atlas to try to determine a correlation between cancer survival and one lab test result. However, we can make more use of the data, for example, by including many more attributes or ‘features’ in the predictive model. The data from TCGA does include many lab test results and other important patient data such as age and year diagnosed. We can use these to develop a predictive model for survival.

In previous posts, I’ve described how to extract data from the TCGA data file downloads. In my particular case, I’m using regex queries to extract relevant data from the XML files provided by TCGA. (Aside: it would probably be more efficient, computationally speaking, to use the xml package, but the dataset I’m working with is relatively small and it never takes more than 5 seconds to run on my ordinary desktop.) Here is a sample of the regex queries and category names:


These are then used to extract information from the XML document, which contains results in the form of numbers (integers) or containing the words ‘Negative’ or ‘Positive’ for mutation data, or strings describing the tumor morphology. These text strings are transduced with the following:


Now we can import all the data into python and clean it up!  One key decision is what kind of analysis to do. There are many entries for patients who have died, and many who have not. We can either make a numerical predictor for days until death, or a category predictor for DEATH or NO DEATH. I decided on first doing a numerical predictor. So what to do with the no-death instances? I decided to discard them to make this as clean a data problem as possible. (You could also decide to set the survival time to an arbitrary high value…I decided against that for now)

Interestingly, surviving patients are coded in the ‘days to death’ variable with a zero. Here is the function and function call I used to remove ‘zero’ values from the days to death variable and the corresponding instances from the feature variables:



The stats package from scipy provides a tidy function for bivariate linear regression. We can use it to find individual correlations between individual features and the outcome of interest:


We can now investigate which features were most predictive of survival, and the associated slopes and p-values:


It is clear that this is a very messy problem, statistically speaking, and we don’t get many (or any) clean predictors. Since we are using 24 different features, this is now a multiple hypothesis testing problem, and if we apply the classical MHT filters, such as Bonferroni, we end up with no statistically significant relations at the p<0.05/24 level.

However, in the absence of a crystal ball… these features are the best we’ve got. We can say probably, for instance, that ‘IDH R140’ p=0.08 is a better predictor than ‘Monocyte Percent’ p=0.83.

So what are we looking at? ‘Year Diagnosed’ and ‘Age Diagnosed’ are our strongest predictors. Year diagnosed corresponds to the calendar year in which the patient was diagnosed (i.e. 2005). It appears that for every year forward we go in time, the patient is predicted to survive about 38 less days. This is relatively unexpected, but is probably related to how we decided to clean the data. We got rid of all the surviving patients, so patients diagnosed in later years (i.e. 2011) that are included must have a relatively shorter survival time than those from earlier years. ‘Age Diagnosed’ indicates the patient’s age when diagnosed with AML (i.e. 45). Older patients are expected to survive for less time. A 10 year increase in age is associated with a 45 day decrease in expected survival.

So what does the overall predictive model look like? Well, it is a little tough to visualize the exact effect of each predictor, so instead, for each instance (patient), we can plot the predicted survival vs. actual. For a perfect predictor, every point would lie on the line y=x, along with the r-squared value for the overall model:


Rsquare Init

Right now, this is not a great predictor, but maybe we can get rid of some outliers. The five points to the right of x=1000 appear to be outliers. What happens if we get rid of them? We can use the following code:


Now how does our model look?



Not too shabby! We need to be conscious of overfitting, since we the number of features (24) is a relatively large fraction of the number of instances (~100) but this is a promising start!

As usual, all code is in the github.

Traversing TCGA: Storing Data in PostgreSQL

This is a post in the ongoing series, Traversing TCGA.

So far we’ve been able to download, extract, organize and analyze The Cancer Genome Atlas data in Python. However, storing data as an internal variable in Python isn’t ideal for all scenarios. Storing data over the long term in a database, such as PostgreSQL, is a better solution. PostgreSQL and other database software allows for stable, structured data storage which can be accessed from anywhere and offers a host of powerful tools for analysis.

PostgreSQL (or just Postgres) installers are available here. During installation, port (default=5432) and password are set.

After installation, the database can be accessed from the psql program. The prompt looks like this:


Using the \dt command lists all tables in the database.


SQL queries like CREATE TABLE, and DROP TABLE are used to make and delete tables.

Postgres databases can also be accessed in python using the psycopg2 library as follows:


The default dbname and username at install is “postgres”. The host “localhost” accesses databases on the same computer, but could be replaced by an IP address to access a database remotely. The getData function refers to code I wrote earlier which extracts patient data.

The following command can be used to create a table:


This creates a table named “cancer_data” with a column called “patient_id” of type text and a column called “gender” of type text. The table can then be populated with the following:


The table is now populated with entries from the patient data extracted from XML data. These can be displayed with the following commands:


Which results in the data to display:


Showing we have successfully populated the table.

Full code is on github.

Traversing TCGA: Trying to Find Trends in the Data

This is an edition of Traversing TCGA, an ongoing project analyzing The Cancer Genome Atlas.

Update: I’ve added a Github repository containing the code for this project here! Check it out.

In the last post, I started extracting data from the XML files downloaded from TCGA. Now, I’ll begin to find trends in the data. I start with identifying some potentially interesting XML labels in the data. Here are a couple:



‘lab_procedute_blast_cell…’ is the result of a lab test which quantifies the percentage of blood cells which are abnormal cells called blasts. ‘days_to_death’ is a number which seems to quantify the number of days from the day the test was performed until the patient died. This number varies significantly with some entries over 1000. Some of the entries are 0, indicating the patient is still alive. The appropriate RegEx queries are then:


Now, we can investigate how these quantities might be related. First, we can ‘clean’ the data, eliminating entries for which either value equals zero.


We can now perform a linear regression and scatter plot.


The slope is slightly positive, but not statistically significant:


And the scatter plot looks like this:


This trend is not necessarily strong enough to be significant, but this shows how we might analyze data and find other important trends.

Traversing TCGA: Making Sense of the Data Files

This post is part of an ongoing series, Traversing TCGA, in which I analyze data from The Cancer Genome Atlas using Python.

Once the download of the data is complete, we end up with a folder full of .xml files containing the clinical data.

XML Files

How do we go from relatively free-from data in hundreds of separate .xml files to a table-like object we can use for data analysis? This process is sometimes called data munging or data wrangling. In this case, a good first step would be to get a list of all the file names in the directory. In python, this is accomplished by the command os.listdir:


We can now open each individual file and extract the relevant information that we want. For example, with the following for statement:


Now we have a for-loop which iterates through every file in the directory. But what exactly do these files look like? After opening in Notepad, here is what it looks like:


Each line is a data entry which uses HTML-like tags to label the data. The first line visible here is for ‘bcr_patient_barcode’ which has the record ‘TCGA-AB-2802’, apparently a unique identifier for this patient. If we use an appropriate regular expressions and functions, we can extract this or any other bit of relevant data we’d like. Here I’ve defined regex to get the patient barcode and patient gender:



Now, we can run the program and see what we get!


Success! The zero entries in the table are because some of the .xml files do not have an entry for gender. However, based on my observations, each patient’s barcode has a gender in at least one .xml file, so we can determine the gender for each entry in a later step.

The main thing we wanted to do was enter the data into a single table, rather than scattered throughout hundreds of separate files, which we achieved. I will continue with further data collection, cleaning and analysis in a later post.