Bayesian Updating, Part 1

I’m reading the new statistics textbook ‘Statistical Rethinking’ by Richard McElreath. Chapter 2 is heavily about the topic of Bayesian updating. That is, producing a posterior distribution from a prior distribution and a new piece of data.

I’ll paraphrase one of the exercises here:

Suppose there are two species of panda bear. Both are equally common in the wild and live in the same places. They are exactly alike but differ in family size. Species A gives birth to twins 10% of the time, and one infant the remainder of the time. Species B has twins 20% of the time. Suppose a female panda of unknown species gives birth to twins. What is the probability her next birth will also be twins.

-Paraphrased from ‘Statistical Rethinking’ by Richard McElreath

Without explicit knowledge of Bayesian updating, I don’t immediately know how to proceed with this question. I have some vague notion that the probability should be around 15%, or possibly higher because the first birth was twins. But an exact numerical answer is out of reach.

Having read the chapter, such an exercise is straightforward. It’s a pure example of Bayesian updating.

The first step is to declare a data story, or method of framing the problem mathematically. We will assume that either panda variety is equally likely without further information. This is called the principle of indifference, in which possible outcomes are considered equally likely in the absence of data. This is our original prior distribution. We will also assume that our knowledge about panda variants is correct, without uncertainty, which is probably not a safe assumption in general, but is made explicit here as an example.

OrigPrior

The first birth of twins is a data point, Twins1, that we can use to update our model. McElreath phrases Bayes Theorem as follows:

Posterior = Likelihood * Prior / Average Likelihood

Posterior is the distribution after including new data, Likelihood is the likelihood of observing the data under any of the real world states under consideration, Prior is the prior distribution, and Average Likelihood is the expected value of the data under the prior distribution.

We will first phrase Bayes as follows, for our case.

Pr(p|Twins1) = Pr(Twins1|p) * Pr(p) / Pr(Twins1)

Here, p is a slightly abstracted variable which includes likelihoods of both species A and species B. Twins1 is the datum of obeserving the first birth to be twins.

Pr(p|Twins1) is the posterior distribution, in other words, the distribution p given that we observe Twins1. This is what we want to calculate.

Pr(Twins1|p) is the likelihood of observing Twins1 under any of the known conditions, in other words, 10% likelihood of twins for species A, or 20% likelihood for species B.

Pr(p) is the prior distribution, which we decided based on the principle of indifference to be 50% species A, 50% species B.

Pr(Twins1) is the overal probability of observing Twins1, under the prior distribution. In other words:

Pr(Twins1) = Pr (Twins1|p) * Pr(p)

We can break this down into explicit terms for the different scenarios represented by p. SpA and SpB represent the probabilities of species A & B respectively.

Pr(Twins1) = Pr(Twins1|SpA) * Pr(SpA) + Pr (Twins1|SpB) * Pr(SpB)

Under our current prior.

Pr(Twins1) = 0.1 * 0.5 + 0.2 * 0.5 = 0.15

If we break down for the overall equation, we get two separate ones:

Pr(SpA|Twins1) = Pr(Twins1|SpA) * Pr(SpA) / Pr(Twins1)

Pr(SpB|Twins1) = Pr(Twins1|SpB) * Pr(SpB) / Pr(Twins1)

Our posterior distribution is thus:

Pr(SpA|Twins1) = 0.1 * 0.5 / 0.15 = 0.33

Pr(SpA|Twins1) = 0.2 * 0.5 / 0.15 = 0.67

Thus, the posterior distribution looks like this:

PostDist

The probability of observing another set of twins, Twins2, is thus:

Pr(Twins2) = Pr(Twins2|SpA) * Pr(SpA) + Pr(Twins2|SpB) * Pr(SpB)

Pr(Twins2) = 0.1 * 0.33 + 0.2 * 0.67 = 0.167 = 16.7%

This is a relatively simple example. In other cases, we are not dealing with discrete categories (SpA & SpB) but a continuous parameter. For example, we may like to determine the actual likelihood of twin birth, call it q, without having known categories. I’ll deal with this case in an upcoming post.

For a more complete exploration of Bayes, see Arbital’s treatment, the LessWrong treatment or Wikipedia. Or even better, get ‘Statistical Rethinking’.

Advertisements

Learning Git

One of the biggest differences between writing academic code and writing production code is the emphasis on readability, maintainability and traceability. While academic code is often only needed once, and used by one or a few people with a low turnover rate, production code is used over and over again, must be adaptable to change and intelligible to anyone who needs to use it.

In my academic work, I wrote MATLAB code for myself, and maybe a few close collaborators. I didn’t worry about version control or documentation, any questions were easily dealt with, and my familiarity with the codebase made it easy for me to navigate. When a paper was published, I posted the complete code and brief instructions on Github using Git, but that was the extent of the documentation. In this case, my use of Git and Github is exclusively as a code storage facility and was not integrated with my development process.

Version control, and Git in particular, are much more powerful than this. I recently read through Atlassian’s Git Tutorial. It was one of the better online tutorials I’ve read, especially because of the excellent context provided throughout.

Capture.PNG
Just enough detail, just enough context.

The tutorial wasn’t just a rundown of git commands. Instead, it provides just enough history of version control and sample workflows that I feel I understand how to use Git in the context of an actual project, and why Git is usually preferred to centralized systems like SVN. It also mixes code snippets, but doesn’t overload on technical details I’m not going to remember for very long.

The tutorial makes technical points when necessary for comprehension and points out their most common uses in typical workflows. It makes the distinction between commits and the branches which point to them, for example, and how this leads to a difference between “git checkout <branch>” and “git checkout <commit>”. It also talks about the common practice of cleaning up the development history of a new branch into logical commits before merging with master. This is a practice which is no doubt critical to long-term project maintenance, but wouldn’t be obvious from a simple description of “git rebase -i”.

The tutorial doesn’t have enough information on its own to serve as a viable Git documentation, but then it doesn’t have to. The official Git documentation, and the excellent book Pro Git are always only a google search away. The importance of the tutorial is the ideas, structures and workflows it imparts to readers, who should then know what to google later when working on a project.

For a data scientist, Git and version control workflows are important skillsets. The development of models and data cleaning code should be traceable through Git, but knowledge of Git for application development is also key, as data scientists are inevitably involved in greater or lesser degree with that process as well.

Traversing TCGA: Implementing an Interactive Web Application

This is the 6th and (probably?) final post in my series Traversing TCGA.

So far, I have downloaded, extracted cleaned and analyzed data from The Cancer Genome Atlas for acute myeloid leukemia (AML) patients. I have used python to perform linear regressions of patient survival against various clinical data.

Now I have made it available as an interactive web application.

Appl

This application allows users to select which features to include in the model, then Submit. A predictive model with is generated with linear reagression and the predicted survival is plotted vs. actual. If the prediction is good, there is a positive correlation. If not, there is little or no correlation.

The web tool uses a PostgreSQL database to store data. In this case, the size of the data is small, and it likely won’t be updated regularly, so a relational database (RDB) like Postgres is not really necessary. The data could simply be stored as a .csv or .xml on the server instead. However, I decided to use Postgres to teach myself how it works and can interact with a web application.

The web page itself uses checkboxes to allow users to select which features to include in the model. Each checkbox has an attribute (‘value’) which uniquely corresponds to the name of a column in the database. Upon clicking submit, the user’s browser sends a list of all these values to the server. The server queries the database for the relevant feature columns and the target variable (survival). Then the server performs a linear regression fit and sends the predicted and actual survivals to the browser.

The communication between browser and server occurs via AJAX, handled by a jQuery.getJSON() request, and python Flask, listening on the server. The Flask app activates Python code to run–scikit learn for statistics and psycopg2 for querying the Postgres database–and generate the linear regression and send the data back to the browser, serialized via JSON. The user’s browser recieves the data and plots it using Plotly.

Please give the app a try. Comments/questions are very welcome.