The py_emra package

I’m pleased to announce the py_emra package for Ensemble Modeling Robustness Analysis (EMRA). EMRA is a tool for modeling metabolic systems which I have worked on through my PhD.

The crux of EMRA is to use dynamic stability as a criteria for model selection and simulation in metabolic systems (e.g. bacterial metabolism). This depends on using what’s called a kinetic model of metabolism, one which uses a rate law to represent the amount of flux through any given reaction. Stability can be analyzed mathematically based on the specifics of these rate laws and how the reaction network is connected.

Understanding the stability of metabolic systems is important for redirecting them towards products of interest like biofuels. In the longer term, stability could be an interesting concept to explore in more complex cases like human diseases, particularly those with a metabolic component like diabetes or cancer.

As an exercise in Python coding and version control with Git, I decided to translate the existing EMRA code base from MATLAB into Python. Luckily, most of the features, including symbolic math and numerical integration, are available.

The Python package and readme is available on Github.

A paper about the package is available on biorXiv.

For more about EMRA see my Google Scholar page.

Advertisements

Distributed Computing: Motivation & File Systems

Why Distributed Computing?

Distributed computing is the practice of computing using a cluster of computers rather than a single one. Distributed computing comes with many advantages.

Work With Very Big Data

Work for today’s data scientists very often involves data sets which are too large to feasibly work with from one local computer. For simple installations of programming languages like R or Python, data larger than the available memory will trigger a memory error if attempted to load into memory.

There are ways to bypass these memory limits on a local system, by shuffling bits of a data set between memory and a single local hard disk (so-called incremental learning), but these methods are inconvenient, slow and not very scalable.

However, the availability of memory in a cluster of multiple computers is higher than a single commodity desktop, allowing for seamless use of massive data sets.

Analyze Data Much Faster

Not only is the amount of data that can be processed much higher, but the analysis can be done much more quickly, due to the availability of multiple processors across the cluster. This allows even computationally expensive tasks to be completed easily.

Convenience & Cost

In the past, high intensity calculations and large data sets were often handled by supercomputers. Supercomputers are highly specialized and expensive, compared to the cost and ease of creating or expanding a distributed cluster of commodity desktops.

Work on Massively Parallel Clusters of Commodity Computers

Distributed computing allows all of these issues to be resolved by using a coordinated group of normal, commodity desktops which communicate through a network. The different computers on a cluster need to be coordinated by a program known as a distributed computing framework. Hadoop MapReduce and Apache Spark are well-known examples.

Making the most efficient use of resources on a cluster is a significant problem, which these software projects have sought to solve.

Distributed File Systems

In this post, I’ll mention just one of the features that allows distributed computing to work, distributed file systems. A file system is a method for arranging data on a hard drive so that it’s clear where one file ends and the next one begins.

Local computers use well-developed file systems like FAT32 or NTFS to organize their hard drives.

Distributed clusters of computers use a distributed file system to keep track of the data stored on the cluster. A significant example is Hadoop Distributed File System (HDFS). HDFS is installed on all the computers across a cluster and works on top of each of the local file systems in a cluster.

HDFS uses network communication to make storage of very large files on a cluster convenient and fault-tolerant. It keeps at least 3 copies of each file on a cluster, in case one or more of the nodes (computers) in a cluster goes down.

Many distributed computing frameworks, including Hadoop MapReduce and Apache Spark work with HDFS.

 

Some of my favorite links

This is a post which I will update periodically with some of my favorite Quora (& Stackexchange) answers.

From Clay Ford’s blog:

HackerNews:

Singular Value Decomposition:

Other:

Bayesian Updating, Part 2

In Part 1, I explored how Bayesian updating operates when there are two discrete possibilities. I now investigate how Bayesian updating operates with one continuous parameter.

This example is from Chapter 2 of ‘Statistical Rethinking’ by Richard McElreath. The premise can be paraphrased as follows:

Suppose you have a globe representing the Earth. It is small enough to hold. You are curious how much of the globe is covered in water. You will toss the globe in the air. When you catch it, you will record whether your right index finger is on land or water. Repeat and record your results.

The data story we use assumes that each toss is independent of the last and that our prior is a uniform distribution, according to the principle of indifference. In other words, all fractions of water, p, are considered equally likely. In reality, a globe that represents our earth must have some water and some land, based on our experience, so p can’t exactly equal 1 or 0. But for now, let us assume uniformity:

OrigPrior

The chart here represents probability density, as opposed to the previous case which showed probability mass. Still, the total probability, represented as the area under the blue line, sums to one.

We make our first observation and it is water or W. This allows us to calculate a posterior distribution, but how?

Bayesian updating uses the following:

Posterior = Likelihood * Prior / Average Likelihood

In this case, the Posterior, Likelihood and Prior represent continuously varying quantities rather than discrete categories as in Part 1. I’ll touch on Average Likelihood later.

The Prior here is the uniform distribution, valued at 1. The Likelihood is the likelihood of observing W under any value of p. This is in fact the definition of p, so Likelihood is just p. If the observation were L for land, Likelihood would be (1-p), since there are only two possibilities.

Average likelihood is a single value which is used to rescale the posterior so that probabilities sum to one. It is the integral of the Likelihood*Prior over all possible values of p. In this case, it would be:

Integral(from p=0, to p=1)[p * 1]dp = 0.5

Thus, our posterior distribution after observing W is:

Posterior = p * 1 / 0.5 = 2p

This looks like the following:

Post1

Notice that the probability density exceeds 1. This is allowed, since it is the integral, not the value of the curve that must equal 1.

We’re not done yet! Let’s say that we spin the globe again but this time observe land, L. What do we do? With Bayesian updating, the posterior becomes the prior for continued observations.

Now, Prior is 2p, Likelihood is (1-p) and Average Likelihood is:

Integral(from p=0, to p=1)[(1-p) * 2p]dp = 0.33

Thus, our updated Posterior after observing W L is:

Posterior = (1-p) * 2p / 0.33 = 6p*(1-p)

Post2

This analysis can continue until, say, 9 observations are made:

W L W W W L W L W

In all, 6 W’s and 3 L’s are observed. Following previous trends, each W will add a factor of p and each L will add a factor of (1-p). The normalization factor can be calculated through integral and is 840.

Posterior = 840 * p^6 * (1-p)^3

PostF

The distribution has gotten much pointier and is looking very similar to Gaussian.

In principle, this method scales up to multiple parameters and much more complicated models. In practice, however, analytical solutions like we have here quickly become impractical, and numerical methods are used.

Grid approximation is one such method where each parameter is represented as a sufficient number (100-200 is usually enough) points rather than strictly continuous. However, with multiple parameters, the total size of the grid-space scales exponentially, so other solutions become important.