Category: Uncategorized

Data vignette: men are worse drivers

This post is based on a notebook I wrote a couple years ago. I’d like to revisit and expand on it, as well as correct some errors. The original notebook is here.

In this post, I analyze traffic collision data from Los Angeles County in January 2012. The analysis is sound, but the conclusion in the title is meant to be taken with a grain of salt, due to the limitations of the data source, including the time span and specific location.

The source of the data is a California state traffic data system called SWITRS (Statewide Integrated Traffic Records System). The system allowed for the download of about 5,000 incidents at a time. The data are submitted by local law enforcement agencies based on traffic incidents they respond to. The system allows downloads in different categories: victims, collisions and parties. The parties data includes all parties in a collision, including parties at fault and not at fault. This becomes important in the contextualization of the data.

Male-female, Fault-no fault

The data contains many dimensions that allow for much more in-depth analysis. However, for our purposes, the analysis is quite simple.

The key to this analysis is to bundle up the data into a two-by-two contingency table and interpret it. Using pandas, the following result can be obtained:

Capture

We can clearly see that for drivers at fault, males outnumber females 2231 to 1369, immediately suggesting that males are far worse drivers when it comes to causing collisions.

Using the basis that males and females are roughly equal portions of the population, we would expect that if they drive with equal safety, they should be equally represented in drivers at fault for collisions.

If you’re paying attention, you probably notice a problem with this logic. While population is equal, men may be over-represented as drivers since they are more likely in the driving and delivery professions. In fact, with this data we can make some estimates about that.

Correcting errors in reasoning

In my notebook, I made an error in summing the at fault and not at fault as a representation of the overall driving population. This would have been a reasonable move if we had counted all the drivers who were not at fault, not just those who are in a collision. However, this is not feasible.

Instead, we take the the drivers in a collision and only those not at fault as a representation of the drivers on the road at any given time. As far as we know, there is nothing special about these drivers, they just happened to be in the wrong place at the wrong time. The total number of drivers (in or not in a collision) in LA County not at fault is much higher, roughly 5 million [1,2]. Under our assumptions, the ratio of male to female drivers is thus 2577:1938 or 1.33:1. Scaling out it’s about 2.15 million:2.85 million male:female drivers who are not at fault.

In comparison, the numbers of drivers at fault is quite small and is not significant to the overall numbers of drivers.

Now calculate the overall difference

The rate of male drivers causing an accident is thus 2271/2.85/100 = 7.97 drivers at fault / 10,000 male drivers.

The rate of female drivers causing an accident is thus 1369/2.15/100 = 6.37 drivers at fault / 10,000 female drivers.

The ratio of risk would then be 7.97/6.37 = 1.25 or a 25% greater risk for male vs. female drivers.

Conclusion: men are, on average, worse (more dangerous) drivers

 

Notes

[1] The actual number used here is an estimate. Intermediate values are therefore accurate only to the accuracy of the estimate. The number is important to show scale relative to the numbers of drivers at fault. However, the actual number would cancel out in the calculation of ratio between male and female risk.

[2] Using only number of drivers here is a bit misleading, since drivers are not on the road for equal amounts of time. In fact, it is reasonable to expect that number of drivers is roughly equal between men and women, but that men spend more hours driving, as they are more likely to be in the driving and delivery professions. Instead, we could estimate 5 million * 30 hrs / month = 150 million driver hours / month. Scaled out it is about 85.6 million:64.4 million driver hours male:female. The accident rates would then be: 2231/85.6/100 = 0.261 drivers at fault / 10,000 male driver hours and 1369/64.4/100 = 0.213 drivers at fault / 10,000 female driver hours.

Advertisements

For presentations, focus on narrative

You’ve collected the data, you’ve run the analysis, now you have to decide how to present. You’ve considered it from every angle, and you’re preparing a slide deck to match–detailed, lengthy and technical. Is this the right approach? Probably not.

Rule of thumb: include no more than one figure per topic

When you’re the technical expert, people aren’t always trying to prove you wrong. They expect you to do the work work correctly, until you give them reason not to. You don’t have to take them through the entire process of analyzing the data. One figure is enough to make the point and move on. More than that becomes boring and forgettable.

This was an adjustment for me. Academic seminar talks are all technical, and audience questions take the presenter through the weeds.

There are exceptions to this rule, but for most audiences, one figure is the limit of attention. Choosing one figure to make the point will help to clarify your message. That said, be ready to answer questions from many angles.

If they don’t want data, what do they want? Narrative

Instead of spending most time thinking about and preparing figures, spend your time thinking about narrative and story telling. What are the keys?

A crucial element of narrative is an upward or progressive arc. We like companies that grow, societies that evolve, people that improve themselves and narratives that move from bad to better:

“We had some trouble early in Q2, but learned our lesson and fixed the problem.”

“We sustained initial losses but our heavy investments paid off in the long run.”

Another part of memorable narratives is the goldilocks principle–not too much, not too little:

“We were over-invested in customer service. We were able to reduce costs without affecting our feedback scores.”

“This model achieves the right balance of simplicity and complexity and will maximize profit”

Another type of narrative that appeals to me is the contrarian narrative: the experts say X, but here’s why they’re wrong. This one’s more of a personal taste and should be used carefully.

Narratives operate at different levels and in different contexts

Business executives craft a narrative about their business that must satisfy investors, motivate employees, generate positive PR, and be stable over time.

In a similar way, data scientists must craft narratives about data that are consistent with the facts, satisfying to leadership, palatable to stakeholders and memorable.

So next time you’re presenting your data, think about narrative, and see how much impact you can have.

Lead Scoring with Customer Data Using glmnet in R

Lead Scoring

Lead scoring is an important task for business. Lead scoring is identifying which individuals in a population may convert (purchase) if marketed to, or assigning them a probability of converting, or determining how much value that individual may have as a customer. Properly using data to support this task can greatly benefit your business, and you personally if it is a skill you can bring to bear on customer data sets.

Lead scoring not a strictly defined concept, but can refer to many processes for identifying valuable customers. Depending on context, you may use customer purchase data, demographic data or web activity to inform your model. However, whether web data or demographic data is used as inputs for the model, the mathematical concepts and models discussed here will remain apply equally.

The R code for this post is available here: https://github.com/theis188/lead-scoring/blob/master/Lead%20Scoring.r

Classification Vs. Regression

Machine learning tasks mostly break down into either classification or regression. Regression tasks involve predicting a number (price of a home, for example), classification tasks involve labeling (will a customer convert or not). Lead scoring may be either classification (customer converts or not), or regression (how much will a custmer spend). I will focus on the classification problem. 

Bias & Variance

Bias and variance are two sources of error in parameter estimation. There is typically a trade-off. As error from bias decreases, error from variance increases. Usually there is a happy medium where total overall error is minimized.

Bias creates error when the model is too simple to capture the true nature of the process being modeled. When bias error is high, small changes. Variance error arises when the model is very sensitive to source data. When variance error is high, small changes in the observations used to train the model will drastically impact model parameter estimates.

Evaluating Model Performance

Model performance is frequently evaluated by cross-validation. Some portion of the data is set aside during training. This data, which the model has not seen before, is used to evaluate and choose the best model. There are other ways of evaluating models known as information criteria, but I will focus on cross-validation. So in this post, I’ll be splitting the data into a ‘train’ and ‘test’ set to evaluate model performance.

Statistical Measures of Performance

In lead scoring, a large majority of customers will not convert. Thus, simple metrics like accuracy may not be sufficient to measure performance. In a typical scenario, maybe 5% of customers convert, so a simple classifier that always predicts “no conversion” achieves 95% accuracy. This may sound impressive until you realize you will not find any leads with this classifier!

Precision is the fraction of instances labeled positive by the classifier that are truly positive. This may be very important when the consequences of false positive is high, perhaps a lie detector test. You can think of it is the ‘positive accuracy’ of the classifier.

Recall is the fraction of all positive cases that are correctly labeled by the classifier. In a scenario like disease testing, this may be very important, since the consequences of a missed positive are severe.

The f1-score is the harmonic mean between precision and recall. A perfect f1-score requires both perfect precision and recall.

Choice of Method

It can be hard to choose between different methods of classification for lead scoring tasks. In general, it is best to train many different types of models and compare. If you have a lot of data, using more flexible methods such as QDA or logistic regression with polynomial terms might be preferable, since the impact of variance error will be lower. With less data, you may consider simpler models or regularization, which is a method for penalizing model complexity. Overall model performance can be assessed using cross-validation and test set performance statistics.

Data Source

The data source I will use is the Caravan data set from the ‘Introduction to Statistical Learning’ textbook. It is a data set with 85 predictors and one outcome variable, ‘Purchase’, i.e. whether the customer converted. There are 5,822 customers and 348 (6%) converted. The data are described here: http://liacs.leidenuniv.nl/~puttenpwhvander/library/cc2000/data.html.

Test and Train

First, let’s split our data into test and train:

(The R code for this post is available here: https://github.com/theis188/lead-scoring/blob/master/Lead%20Scoring.r)

 

library(ISLR)
splits <- split_data(Caravan,0.2)
test_splits(splits)

We first load the ISLR package and then split the data. I define a number of helper functions which you can inspect in the source code for this post. The function ‘split_data’ splits the data “Caravan” into test and train (20%) and normalizes it. Normalization is a good practice when we are going to use regularization, which is a method of controlling the bias-variance tradeoff.  We then test the splits for size and normalization using ‘test_splits’.

Logistic Regression

Logistic regression uses a form similar to linear regression. It models probability using a transformation function so that predicted values always lie between 0 and 1. Because of the flexibility and popularity of this method, and the number of implementations available, I will spend most time on it.

There are many implementations of logistic regression in R, I will focus on the glmnet package. By default, glmnet fits 100 different models, each with a different level of regularization. More regularization (L1 by default in this package) will set more coefficients to 0, and reduce variance error.

In my implementation. We fit the models as such:
fit <- glmnet( matrix_from_df(splits$train_x), splits$train_y, family='binomial')

(Again, helper functions defined in source code for this post).

Logistic regression regression works in a method similar to accuracy maximization. In other words, it treats positive and negative examples as equally important! In these cases, classifiers are very reluctant to mark observations as potential conversions, since it’s most likely wrong.

We can test which of the observations had a greater than 50% chance of converting:
 fit <- glmnet( matrix_from_df(splits$train_x), splits$train_y, family='binomial')
 test_predict <- predict(fit,newx=matrix_from_df(splits$test_x) )

In my test/train split, only 3 instances, out of 1110 were marked positive. At least it got all 3 of those correct!
get_confusion_matrix(test_predict[,100]>0,splits$test_y)

     pred
test  FALSE TRUE
 No   1033 0  Yes    74 3

So is there anything we can do? Fortunately there is. By default, the logistic regression uses a cutoff of 50% probability. If the classifier sees a greater than 50% chance of conversion, it marks it positive. We can choose our own cutoff, or decision boundary. Since we know positive conversions are much more important than negative, we could choose, say 10%. If there is even a 10% chance of conversion, we want to mark it as positive.

Let’s use the 10% cutoff and see what levels of regularization are best:

Test_F1

Lambda, here, is a measure of regularization, and it’s usually a ‘not too much, not too little’ situation. We see that the F1 score on the test set is highest for lambda between 0.005-0.010 and is lower outside that range.

Feature Selection

There are a number of methods for feature selection. A simple method is known as L1 regularization. This simply penalizes the model based on the absolute value of all the coefficients. As a result the model sets many of the coefficients to 0. This is the default behaviour of glmnet. We can determine what are the most important variables and how they effect the prediction. In this case, I selected a high level of regularization and output the nonzero coefficients.
(Intercept)    MOPLLAAG MINKGEM   MKOOPKLA PPERSAUT   PBRAND APLEZIER
-2.85973655 -0.02207064  0.01358764 0.04784260 0.35163469  0.03212729 0.08691317

You can then use inference to guide your customer acquisition behavior. Here, PPERSAUT is highly positive, so it is highly correlated with conversion. I redid the model several times with different train/test splits and PPERSAUT appears to be significant most of the time.

In the data source, the response variable is if the customer purchased mobile home insurance or not. PPERSAUT is whether or not the customer has car insurance. Thus, it seems someone with one type of insurance is more likely to buy other kinds.

There are other methods for feature selection such as forward selection and backward selection, but I won’t discuss those here.

Moving From Statistical Metrics to Financial Metrics

In a business context, statistical metrics like f1-score are important, but less important than financial metrics like expected revenue and expected profit. Let’s assume that every customer costs $1 to market to and if they convert, it generates a profit of $10.

Here, I choose a low, medium and high lambda and vary the cutoff probability. We can now calculate an expected profit for every version of the model and pick the best:

Profit

It looks like about 8% gives the highest profit for each of the values of lambda. The high lambda model can generate the highest profit in this case, but the result seems unstable. It would be a good idea to perform k-fold cross validation on this particular model and test it on other random subsets of the data to ensure good performance. In general, the results for low and medium lambda models look relatively stable and repeatable.

Other Methods

There are a handful of other methods that are good for lead scoring classification.

Linear & quadratic discriminant analysis (LDA & QDA) approaches fit multivariate gaussian distributions to each class in the response variable. LDA assumes the same covariance in each class, and as a result has only linear decision boundaries. QDA fits covariance within each class and thus allows for more complex decision boundaries. QDA may be better if you have more data, or fewer predictors while LDA may be better with less data or more predictors.

K-nearest neighbors (KNN) is a non-parametric approach, meaning it makes no assumptions about the form of the relationship between variables and response. For any given instance, the KNN looks at the nearest K labeled instances and predicts the majority class. The only hyerparameter to fit is K, lower K is very flexible, higher K is less flexible.

Support vector machines (SVM) can be very good if the relationship between predictors and response is non-linear. SVM is actually similar to linear regression, but uses a function called a kernel and the so-called ‘kernel trick’ to find similarity between instances in a higher dimensional space. SVM is worth a try since it can find important non-linear relationships. However, as with many sophisticated modeling strategies, model intelligibility, or understanding of the significance of different factors, is decreased.

More Sophisticated Models

The difficulty of implementing more sophisticated models like neural networks and gradient boosting has decreased. For some classes of data, such as images, audio and text, neural networks offer much superior performance.

In business-oriented contexts, simple models can be better. You will need to explain the model and get buy-in. Simple models tend to be easier to implement and have simpler failure modes. They can also be important for inference, which means reasoning about the relationship between inputs and outputs. For example, if you notice that a certain zip code has customers that are highly likely to convert, perhaps you decide to invest in more customer acquisition in that area.

An additional consideration is that distributions of customers and customer preference will change over time. Voice and speech classification models will remain nearly fixed over time, allowing for the use of exquisitely fit neural network models. On the other hand, simple, robust, low variance, models may provide better performance in a business context where customers tend to evolve over time.

Being Aware of Data Sourcing

There are many biases and sources of error to avoid. One key source of bias I will call the ‘gathered data bias’. This means the source of your training data is not the same as the source of data the model will be applied to. For example, in marketing, you may have good conversion data on a set of customers that the marketing team selected. However, the marketing team may have selected based on who they thought would convert, rather than being a random sample.

If you apply the trained model to the population in general, you may not get the results you expect. For instance, let’s say marketing selects only customers in one income band. It is impossible to calculate an impact from customer income, so the true significance of this variable will not be discovered, and the model may not perform on customers in other income bands. This highlights the importance of investing in data. To train a universal model, you will have to get universal data, which means marketing to a totally random sample of the population.

References & Mathematical Detail

For mathematical details, there are excellent treatments available free.

For an introduction, Andrew Ng’s Coursera course ‘Machine Learning’ is good resource:

https://www.coursera.org/learn/machine-learning/

The PDF version of ‘Introduction to Statistical Learning’ is available free from the authors and is highly recommended. Much of this post is based around this book.

http://www-bcf.usc.edu/~gareth/ISL/

‘Elements of Statistical Learning’ is a much more rigorous treatment.

https://web.stanford.edu/~hastie/ElemStatLearn/

Data Science From Scratch

I visit Quora regularly and am always astonished by the number of people asking how to become a data scientist. It’s a fascinating field, and one I was able to (mostly) “bootstrap” into, out of a quantitative PhD (bioengineering). There are three main components to making it in data science.

Understand Programming

For true beginners, tutorials like W3Schools will get you off the ground.

For practical skills, Cracking the Coding Interview is a highly regarded guide. I’d also recommend coding challenges like leetcode and codewars.

It’s useful to know about data structures and algorithms. I learned this topic mostly ad hoc. CLRS has been recommended to me, though I haven’t read it myself.

For background on the history and culture of programming, Paul Ford’s book-length essay What is Code? is truly indispensable. It’s the best essay I have ever read on programming and maybe the best ever written.

Python and SQL are the most important languages for data science.

Understand Machine Learning

The absolute best place to begin is Andrew Ng’s Machine Learning Course. Regression, classification, gradient descent, supervised, unsupervised. Even a brief piece on neural networks. Essential knowledge, good presentation, good depth, good assignments.

“Introduction to Statistical Learning” and “Elements of Statistical Learning” are a great resource, expanding on Ng in some ways. ESL goes into considerable depth, and the PDFs are available online from the authors.

Machine learning has moved on and it’s good to be familiar with the new techniques, namely XGBoost and deep learning.

For XGBoost, reading through the documentation is a good place to start. There’s also a scholarly paper about it.

For neural networks, I recommend Stanford 231n Winter 2016. I think there is a newer version up, but I have not studied it yet.

Understand Statistics

In some ways, this may be the least important piece of it, at least for starting out. It’s rarer to get an interviewer who really understands statistics. Still, I very much recommend understanding statistics, and I value the statistics I have learned. In the long run, it will help you justify and be confident in your results.

The point of entry here is Brian Caffo’s Biostatistics Boot Camp. It’s a little bit dry, but I truly valued the precision and rigor in this course. There is also a part 2.

I read the textbook “Statistical Rethinking” on Bayesian modeling and thought it was excellent. There’s also a free PDF online from the author of “Doing Bayesian Data Analysis”, which has been recommended to me.

“Mathematical Statistics and Data Analysis” is an excellent, comprehensive reference, but maybe not worth reading all the way through for most.

Other areas of statistics I want to learn more about: information theory, latent variable modeling, factor analysis, linear modeling.

Keeping Up

That’s it. If you understand programming, machine learning and statistics, you are well on your way to landing a data science job.

It’s important to stay on top of the game. Podcasts, meetups and forums are good for this. For podcasts, the Google Cloud podcast is good, and Data Skeptic is also not bad. The best forum I know of is probably Hacker News.

It’s also important to explore your own interests as well. I’ve been finding myself more towards the ETL/data engineering/infrastructure end of things rather than pure analysis.

Deep Neural Networks: CS231n & Transfer Learning

Deep learning (also known as neural networks) has become a very powerful technique for dealing with very high dimensional data, i.e. images, audio, and video. As one example, automated image classification has become highly effective. This task consists of putting an image into one of a certain number of classes.

Look at the results of the ImageNet Large Scale Visual Recognition Challenge. This is an annual challenge for image classification, recognition and detection algorithms, for images from 1,000 categories. The top-5 error rate has decreased from 28% in 2010, to 26% in 2011, to 15% in 2012, then 11%, 7%, 3.5%, to 3% this year. (Classification error). Other tasks continue to improve with the use of neural networks, especially convolutional and recurrent neural networks.

The number of resources for learning about neural networks has also multiplied dramatically. One resource I can recommend firsthand is the CS231n Stanford course about image recognition. The course syllabus, including slides and lecture notes (and Jupyter notebook assignments) are available online. Additionally, the lectures are available on YouTube. I’m up to lecture 5 and I can highly recommend both the slides and YouTube lectures. There is really excellent theoretical background on topics like the history of image processing, loss functions, backpropagation, but also practical advice on weight initialization, transfer learning, learning rate, regularization and more.

Another resource that has been linked recently on HackerNews is the Yerevann guide to deep learning, which seems to be a good, deep source of information.

 

The folks at TensorFlow just keep improving their offerings. I recently implemented their transfer learning using Inception v3 using 8 categories I chose from ImageNet. It was surprisingly easy to train my own classifier, which was quite effective (top 1 error rate <10%)! The only real issue I had was some bazel errors which were resolved by upgrading my version of bazel. Training on roughly 8,000 images took about 12 hours on a MacBook Pro using CPU only. To be more specific, the bottleneck phase took about 12 hours, while the actual training about 20 minutes. Using ~1,000 images per category, is probably not totally necessary for an effective classifier, so you can likely cut down on this time dramatically.

 

Collecting Neighborhood Data

Since deploying my latest web application, a Los Angeles Neighborhood Ranker, I’ve wanted to explain the process of gathering the data. The first step was to decide which neighborhoods to use, what they’re called, and how they’re defined, geographically.

The first stop I made is the LA Times Mapping LA project. It has a mapping feature for many Los Angeles neighborhoods. Each neighborhood has an individual page where the neighborhood is plotted on a map. After inspection of the page’s source code, the coordinates of the neighborhood’s boundaries can be discovered:

(From http://maps.latimes.com/neighborhoods/neighborhood/santa-monica/)

"geometry": { "type": "MultiPolygon", "coordinates": 
[ [ [ [ -118.483981, 34.041635 ], [ -118.483766, 34.041430 ]...

This information can be collected and stored for further use via web scraping using a Python library like urllib2 to open the page and regular expression to find the coordinates. The following shows the essential steps:

import urllib2
import re
### Code here to define fullurl
webpage = urllib2.urlopen(fullurl)
for lines in webpage:
  d = re.search(r"geometry.+(\[ \[ \[ \[.+\] \] \] \])",lines)
  if d:
    ### store d.group(1)

Storing these coordinates to text allows them to be used in mapping and data gathering. Mapping is accomplished with leaflet.js and Mapbox, and I can describe that later. For now I will talk about how the nieghborhood coordinates help in the data collection process.

For the neighborhood ranker, I needed neighborhood-level data in categories like house price and others. Unfortunately, the existing real estate APIs (Zillow e.g.) don’t support neighborhood-level data, nor would it be likely they have exactly the same neighborhood definition.

The Zillow API, which I used for real estate data, only supports single house queries based on street address. So, how to go from single Zillow queries to average house price in a neighborhood? A first natural instinct may be to collect information on every house in a neighborhood and compute the average.

However, perhaps it isn’t necessary to get every value. Consider surveys like the census and employment situation surveys. These gather important information about populations without getting information from every member of the population. Instead they use knowledge of statistics and error to estimate the values and quantify uncertainty.

Thus, getting every house in the neighborhood is unnecessary. But how many are necessary? We estimate the standard error of the mean using sample information:

SE,mean = s/sqrt(n)

Where s is the sample standard deviation and n is the number of samples.

Thus, if the distribution of prices for each neighborhood is relatively tight, you wouldn’t need many samples at all to get a good estimate of population mean.

In general what I found is that only about 5 samples were needed per neighborhood to get a good estimate, with estimated error about 10-20% of sample average. This is sufficient to rank the neighborhoods as they vary in average price from $300,000 to $5,000,000 or about 1500%.

But how do you generate random samples within neighborhoods? Zillow doesn’t support it. One thing that may help is to first generate a random coordinate-pair within the neighborhood. I use the following:

def get_random_point_in_polygon(poly):
  (minx, miny, maxx, maxy) = poly.bounds
  while True:
    p = Point(random.uniform(minx, maxx), random.uniform(miny, maxy))
    if poly.contains(p):
      return p

The poly argument is a Polygon object from the Shapely library, which defines the neighborhood boundry. The function returns a randomly generated Point object which is inside the Polygon object. We can actually plot the neighborhood and look at the coverage we get with let’s say, 50 points.

xy = coords['santa-monica']

xylist = zip(*xy)
p = Polygon(xylist)
points = [get_random_point_in_polygon(p) for i in range(50)]

ax = plt.subplot()
ax.plot(*xy)
for point in points: ax.plot(point.x,point.y,marker = 'o')

plt.show()

This is the result:

samopts

50 random points in the Santa Monica neighborhood.

There is one more step. Zillow accepts API queries using addresses, not coordinates. You can get addresses from coordinates by using a ‘geocoding’ API. I used Google’s but there are others available. Using the address information, the Zillow API may be called, and the price information extracted. A similar method can be used for Yelp API and restaurant density.

On a technical note, this method may not give a completely random sample from a neighborhood. A random point will be assigned to the nearest address during the geocoding step. Thus, addresses in high density areas are less likely to be chosen. If price correlates with density within a neighborhood, this could yield biased samples.

Ranking Neighborhoods in Los Angeles

Ranking Neighborhoods in Los Angeles

My latest web application is an interactive, personalized map-based ranking of neighborhoods in Los Angeles. I’ve spent the last few weeks gathering data points for each of 155 neighborhoods in the Los Angeles area. That on its own could (and will soon) be the topic of its own post.

For now, I wanted to explain the ranking algorithm I use to come up with the rank score.

First, the features I have gathered on each of the neighborhoods:

  • Property Crime
  • Violent Crime
  • Air Quality
  • School Quality
  • Population Density
  • Restaurant Density
  • Price to Buy
  • Price to Rent

How to rank neighborhoods based on these characteristics? How to do so in a way that allows users to choose which features they care about and their relative importance?

First, assign each neighborhood a score within each category. For purposes of simplicity, I made each score from 0 to 100. The best neighborhood in each category would rank 100, the worst 0. But how exactly to determine this? Some features are clearly better when low (price, crime) and some are better when higher (walkability, school quality). Furthermore, there is the choice of whether to apply any other transformations to the data such as log-scaling etc.

So I defined functions that transform the data in desired ways.

import math
def standard(n):
	return n
def neglog(n):
	return -math.log(n)
def neg(n):
	return -n

Now I can define a dictionary pairing each dataset with the desired transformation function.

 fundic = {
     'aqf':neg, #air quality
     'hpf':neglog, #house price
     'rpf':neg, #rent pruce
 'pcf':neg, #property crime
 'pdf':standard, #population density
 'rdf':standard, #restaurant density
 'sqf':standard, #school quality
 'vcf':neg, #violent crime
 'wsf':standard, #walkscore
}

Python treats functions as first-class citizens (basically as variables) so this allows us to pass the function from the dictionary basically as any other variable. Within a loop that iterates over each data set, we can select the function corresponding to the dataset stem of choice.

 for stem in list_of_stems:
      fun = fundic[stem]

Now, while loading the data set, we apply the function

 featdic = {}
 for line in datafile:
     feats = line.split(',')
     featdic[ feats[0] ] = fun( float( feats[1] ) )

Here we load the raw value into a dictionary keyed by the neighborhood name. These appear together in the data storage file separated by a comma. Now find the max and min.

 vals = featdic.values()
 minval = min(vals)
 maxval = max(vals)

Now to put the data in order and normalize it.

 orderfile = open('name_order.txt','r')
 for line in orderfile:
     name = line
     rankscore = ( featdic[name] - minval)/( maxval - minval ) * 100.0

So this gives us a rank score now normalized to 0 to 100 and transformed by negative, log or any other transformation we’d like.

Now to address another question: given the user’s rank order of features, how to use that to generate a composite ranking? There is no perfect answer to this question. Of course, you could give the user control over the exact weighting, but this seems like unecssary feature bloat. Instead I decided to weight harmonically. The harmonic series goes like: 1, 1/2, 1/3, 1/4 … This weighting is arbitrary but seems quite natural. First place is twice as important as second, three times as important as 3rd and so on.

On the server side, I have Flask waiting for the names of columns to include in the ranking. First I have a copy of the order of the names in the database. (In this case just a list of lists in a .py file, but could be in a SQL database or other configuration)

names = ["prop-crime", 
"viol-crime",
"air-qual",
"school-qual",
"pop-dens-h",
"pop-dens-l",
"rest-dens",
"buy-price",
"rent-price",
"walk-score"]

Now I can define a function to get the composite ranking. First get the number of columns.

def getrank(cols):
    n = len(cols)

Load in the names and the list of scores.

    scoremat = scores()
    nameslist = names()

Now calculate the overall weight from n columns (i.e. the sum from 1 + 1/2 + … 1/n)

     norm = sum( [ 1/float(i+1) for i in range(n)] )

And now get the rank and index of each feature in the list.

    rankinf = [(i+1, nameslist.index(j) ) for i,j in enumerate(cols)]

And finally calculate the composite rank for each neighborhood and return it as a list of strings.

    return [ '{0:.1f}'.format( sum( [ 1/float(i)*sl[j] for i,j in rankinf ] )...
        /norm ) for sl in scoremat ]

SQL: Combining tables without JOIN

SQL is a versatile language–there are many ways to get the job done.

One way to compare information from multiple tables is the JOIN command, but there are sometimes alternatives.

Let’s say I’ve got two tables, office and person:

officeperson

Let’s say I need to know which people have printers. I could do a JOIN ON office_room = office_num. On the other hand, if you just select FROM multiple tables at the same time, you can filter with WHERE.

SELECT name, has_printer FROM office, person WHERE office_room=office_num;

has-printer

It turns out that SELECT FROM tbl1,tbl2 is actually the same as a CROSS JOIN, also known as the cartesian product of two tables. This joins every row of the first table with every row of the second table. Use WHERE to filter out only the correct rows you want.

Now let’s say I want to know the number of chairs left in each office. One way would be to JOIN, something like:

SELECT office_num,
num_chairs – count(name)
AS chairs_left
FROM person
JOIN office
ON office_room = office_num
GROUP BY office_room;

But another option is a correlated subquery, where the subquery refers to a column in the outer query, and doesn’t use JOIN at all:

SELECT office_num,
num_chairs –
(SELECT count(name)
FROM person
WHERE office_room = office.office_num)
AS chairs_left
FROM office

chairs_left

Correlated subqueries definitely have a performance hit, since each row is evaluated separately, but they can be quite handy.

Note: These recommendations aren’t best practices, but handy syntax to be aware of and are sometimes convenient.

Information Criteria & Cross Validation

A problem of predictive models is overfitting. This happens when the model is too responsive and picks up trends that are due to quirks in a specific sample, and not reflective of general, underlying trends in the data process. The result is a model that doesn’t predict very well. A model can be made less responsive by regularization–i.e. penalizing the model for complexity–and/or by reducing the number of parameters.

Of course, a completely unresponsive model is not correct either, so a balance must be struck. But how much regularization is the right amount? How many parameters is the correct amount? There are (at least) two main ways to test: cross validation & information criteria.

Cross validation

Cross validation is the practice of leaving out some of the data from the training set (20-40% of the data) and using it to select between different models. The notion here is that the leave-out data is ‘fresh’ to the model, and is thus representative of new data that the model would face in production. The various models being selected between can all be tested against the leave-out data, and the one that scores the best is selected.

Information criteria

Information criteria are a slightly different approach. In general, they rely on calculating the joint likelihood of observing the data, under the model, and taking the negative log, a quantity known as ‘deviance’. The information criteria is some function of the deviance. The Akaike Information Criterion (AIC) is two times the deviance plus two times the number of parameters. The information criteria, just like cross validation, can be used for model selection.

Calculating deviance requires a model that includes error, not just a point estimate. (Otherwise: the likelihood of any data point is just zero). In some methods of model generation (i.e. normal equation for linear regression), the error isn’t explicitly calculated. Thus, information criteria can’t be used for these ‘off-the-shelf’ types of models.

When they are used

Information criteria are often used in the context of Bayesian modeling, when the model explicitly includes error, and determines uncertainty in all parameters. The information criteria are somewhat abstract but seem more soundly based, theoretically speaking.

In contrast, cross validation can be used even when error & uncertainty is not modeled. Additionally, cross validation is highly applied and the principle makes sense and appeals even to machine learning novices.

Overall, both methods are highly useful and informative. Information criteria may be more sound, theoretically speaking, and may appeal to academic types for this reason. In general, however, more people are likely to be familiar with cross-validation, and it’s probably easier to explain and sell this technique to a non-technical audience.

A brief aside: Video Productions

On a slightly different note, I’d like to showcase my award-winning video production here.

These two won a total of $2500 in a competition sponsored by UCLA Department of Chemical and Biomolecular Engineering:

These were produced for UCLA California NanoSystems Institute.

These two bike safety videos each won $200 in a contest sponsored by UCLA Bicycle Coalition.