A brief aside: Video Productions

On a slightly different note, I’d like to showcase my award-winning video production here.

These two won a total of $2500 in a competition sponsored by UCLA Department of Chemical and Biomolecular Engineering:

These were produced for UCLA California NanoSystems Institute.

These two bike safety videos each won $200 in a contest sponsored by UCLA Bicycle Coalition.

Advertisements

Jupyter Notebook Explorations

I’ve recently been getting to know the Jupyter Notebook environment. It’s a very convenient development environment for many languages, but is well-known as the evolution of the iPython Notebook.

Jupyter works by starting a server on your own computer to execute python (or another language) code. The notebook is rendered in a web browser and consists of cells of code. Cells can be run individually, which sends the code to the Jupyter server on your computer. The results (text, plots or error messages) are sent back to the browser and displayed directly below the code cell. Additionally, the Python instance on the server stores the variables created, so they can be used again in another cell.

Alternatively, you can access a remote server which is set up for Jupyter Notebook use, such as at try.jupyter.org.

The benefits of Jupyter over, let’s say, Sublime/cmd, are convenience and presentation style. The Jupyter Notebook, because of the cell architecture, makes it easy to run one small piece of code at a time. This makes debugging much faster & more streamlined, as it’s easy to check intermediate variable values, types and so on. Further, it’s very easy to share your work, for example on nbviewer.jupyter.org, and it looks good too, with all plots rendered on page.

On the other hand, Jupyter Notebook is not really in the same category with ‘fully featured’ Python IDEs like PyDev or PyCharm. For one thing, the output is not a .py file, but a .ipynb file, which which is actually a text file holding the code and output in a JSON array. Development of python packages or applications is thus not really possible or expected in Jupyter. On the other hand, demonstration of existing packages and data analysis is the main point and expected use of Jupyter.

Check out my Jupyter Notebooks:

Statistical Diversions: Collinear Predictors

Collinear predictors present a challenge in model construction and interpretation. This topic is covered in intuitive and engaging style in Chapter 5 of the excellent book Statistical Rethinking, by Richard McElreath.

Collinear predictors refers to when predictor variables correlate strongly with one another. To see why this can be a problem, consider the following example.

Let’s say we have:

Independent variables: x,  y

Dependent variable: z

We can use R to generate sample data sets in the following way:

x = rnorm(50)
y = rnorm(50,mean = -x,sd = 1)

This generates a set of normally distributed samples, x, with mean 0 and sd 1. It then generates a sample of y, using -x as the mean for each sample. This produces correlated variables:

cor(x,y)
[1] -0.7490608

Correlation varies from -1 to +1, with 0 being uncorrelated and +/- 1 indicating perfect (linear) correlation.

We can plot this data in 2 dimensions to get a better picture:

XY

As we can see, x & y are clearly correlated.

We can now generate a sample for z:

z = rnorm(50, mean = x+y)

Let’s take a look at the x-z and y-z plots:

XZ

Here x and z appear completely uncorrelated.

YZ

While y and z look to have a slight positive correlation. Is that all we can do?

We can also use built-in R functions to look at the point estimates for the slopes of a linear regression on these data sets:

LMxz = lm(z ~ x)
LMxz$coefficients[2]

x
-0.1678911

LMyz = lm(z ~ y)
LMyz$coefficients[2]

y
0.5118612

As we suspected, the slope for x is close to 0, while for y it is larger and positive.

(Note that we have only looked at the point-estimates for the slopes without quantifying uncertainty. In general, it’s better to have an estimate for the uncertainty in model parameters, but we’ll leave that for another time.)

Is this the whole story? Let’s try the following:

LMxyz = lm(z ~ x + y)
LMxyz$coefficients[2:3]

x                      y
0.9391245     0.9768274

Here we can see that when the linear model includes both x & y, both slopes are positive and near 1. What happened to our other coefficients, -0.17 and 0.51?

Picture a three dimensional space, x-axis coming towards you, y-axis going left and right & z-axis going up and down.

XYZ

Now remember, most of our data is actually scattered roughly around the line y = -x. Combine that with the plane z = x+y (roughly what we got from the multivariate linear model), and this is what it looks like:

XYZpl

The plane cuts through the origin, coming down, towards us and to the left, while going away, up and to the right. The line y =-x is indicated by a thin gray line. The intersection of the line and the plane occur at z=0 (!).

This is how the single predictor models don’t show an effect, while the multivariate does. The error in the data averages roughly around the plane as it cuts through the origin, so it is only after controlling for y that the effect of x can be determined (and vice versa).

The predictive power of an x-only or y-only model is not awful (though it is worse), so long as the new data still clusters around y=-x. However, the interpretation of the model is significantly clouded by the collinearity between x & y until the multivariate model is inspected.

Fun side note: the result we get nearly recovers the statistical process used to generate z ( z = rnorm(50, mean = x+y) ).