Month: March 2018

Lead Scoring with Customer Data Using glmnet in R

Lead Scoring

Lead scoring is an important task for business. Lead scoring is identifying which individuals in a population may convert (purchase) if marketed to, or assigning them a probability of converting, or determining how much value that individual may have as a customer. Properly using data to support this task can greatly benefit your business, and you personally if it is a skill you can bring to bear on customer data sets.

Lead scoring not a strictly defined concept, but can refer to many processes for identifying valuable customers. Depending on context, you may use customer purchase data, demographic data or web activity to inform your model. However, whether web data or demographic data is used as inputs for the model, the mathematical concepts and models discussed here will remain apply equally.

The R code for this post is available here: https://github.com/theis188/lead-scoring/blob/master/Lead%20Scoring.r

Classification Vs. Regression

Machine learning tasks mostly break down into either classification or regression. Regression tasks involve predicting a number (price of a home, for example), classification tasks involve labeling (will a customer convert or not). Lead scoring may be either classification (customer converts or not), or regression (how much will a custmer spend). I will focus on the classification problem. 

Bias & Variance

Bias and variance are two sources of error in parameter estimation. There is typically a trade-off. As error from bias decreases, error from variance increases. Usually there is a happy medium where total overall error is minimized.

Bias creates error when the model is too simple to capture the true nature of the process being modeled. When bias error is high, small changes. Variance error arises when the model is very sensitive to source data. When variance error is high, small changes in the observations used to train the model will drastically impact model parameter estimates.

Evaluating Model Performance

Model performance is frequently evaluated by cross-validation. Some portion of the data is set aside during training. This data, which the model has not seen before, is used to evaluate and choose the best model. There are other ways of evaluating models known as information criteria, but I will focus on cross-validation. So in this post, I’ll be splitting the data into a ‘train’ and ‘test’ set to evaluate model performance.

Statistical Measures of Performance

In lead scoring, a large majority of customers will not convert. Thus, simple metrics like accuracy may not be sufficient to measure performance. In a typical scenario, maybe 5% of customers convert, so a simple classifier that always predicts “no conversion” achieves 95% accuracy. This may sound impressive until you realize you will not find any leads with this classifier!

Precision is the fraction of instances labeled positive by the classifier that are truly positive. This may be very important when the consequences of false positive is high, perhaps a lie detector test. You can think of it is the ‘positive accuracy’ of the classifier.

Recall is the fraction of all positive cases that are correctly labeled by the classifier. In a scenario like disease testing, this may be very important, since the consequences of a missed positive are severe.

The f1-score is the harmonic mean between precision and recall. A perfect f1-score requires both perfect precision and recall.

Choice of Method

It can be hard to choose between different methods of classification for lead scoring tasks. In general, it is best to train many different types of models and compare. If you have a lot of data, using more flexible methods such as QDA or logistic regression with polynomial terms might be preferable, since the impact of variance error will be lower. With less data, you may consider simpler models or regularization, which is a method for penalizing model complexity. Overall model performance can be assessed using cross-validation and test set performance statistics.

Data Source

The data source I will use is the Caravan data set from the ‘Introduction to Statistical Learning’ textbook. It is a data set with 85 predictors and one outcome variable, ‘Purchase’, i.e. whether the customer converted. There are 5,822 customers and 348 (6%) converted. The data are described here: http://liacs.leidenuniv.nl/~puttenpwhvander/library/cc2000/data.html.

Test and Train

First, let’s split our data into test and train:

(The R code for this post is available here: https://github.com/theis188/lead-scoring/blob/master/Lead%20Scoring.r)

 

library(ISLR)
splits <- split_data(Caravan,0.2)
test_splits(splits)

We first load the ISLR package and then split the data. I define a number of helper functions which you can inspect in the source code for this post. The function ‘split_data’ splits the data “Caravan” into test and train (20%) and normalizes it. Normalization is a good practice when we are going to use regularization, which is a method of controlling the bias-variance tradeoff.  We then test the splits for size and normalization using ‘test_splits’.

Logistic Regression

Logistic regression uses a form similar to linear regression. It models probability using a transformation function so that predicted values always lie between 0 and 1. Because of the flexibility and popularity of this method, and the number of implementations available, I will spend most time on it.

There are many implementations of logistic regression in R, I will focus on the glmnet package. By default, glmnet fits 100 different models, each with a different level of regularization. More regularization (L1 by default in this package) will set more coefficients to 0, and reduce variance error.

In my implementation. We fit the models as such:
fit <- glmnet( matrix_from_df(splits$train_x), splits$train_y, family='binomial')

(Again, helper functions defined in source code for this post).

Logistic regression regression works in a method similar to accuracy maximization. In other words, it treats positive and negative examples as equally important! In these cases, classifiers are very reluctant to mark observations as potential conversions, since it’s most likely wrong.

We can test which of the observations had a greater than 50% chance of converting:
 fit <- glmnet( matrix_from_df(splits$train_x), splits$train_y, family='binomial')
 test_predict <- predict(fit,newx=matrix_from_df(splits$test_x) )

In my test/train split, only 3 instances, out of 1110 were marked positive. At least it got all 3 of those correct!
get_confusion_matrix(test_predict[,100]>0,splits$test_y)

     pred
test  FALSE TRUE
 No   1033 0  Yes    74 3

So is there anything we can do? Fortunately there is. By default, the logistic regression uses a cutoff of 50% probability. If the classifier sees a greater than 50% chance of conversion, it marks it positive. We can choose our own cutoff, or decision boundary. Since we know positive conversions are much more important than negative, we could choose, say 10%. If there is even a 10% chance of conversion, we want to mark it as positive.

Let’s use the 10% cutoff and see what levels of regularization are best:

Test_F1

Lambda, here, is a measure of regularization, and it’s usually a ‘not too much, not too little’ situation. We see that the F1 score on the test set is highest for lambda between 0.005-0.010 and is lower outside that range.

Feature Selection

There are a number of methods for feature selection. A simple method is known as L1 regularization. This simply penalizes the model based on the absolute value of all the coefficients. As a result the model sets many of the coefficients to 0. This is the default behaviour of glmnet. We can determine what are the most important variables and how they effect the prediction. In this case, I selected a high level of regularization and output the nonzero coefficients.
(Intercept)    MOPLLAAG MINKGEM   MKOOPKLA PPERSAUT   PBRAND APLEZIER
-2.85973655 -0.02207064  0.01358764 0.04784260 0.35163469  0.03212729 0.08691317

You can then use inference to guide your customer acquisition behavior. Here, PPERSAUT is highly positive, so it is highly correlated with conversion. I redid the model several times with different train/test splits and PPERSAUT appears to be significant most of the time.

In the data source, the response variable is if the customer purchased mobile home insurance or not. PPERSAUT is whether or not the customer has car insurance. Thus, it seems someone with one type of insurance is more likely to buy other kinds.

There are other methods for feature selection such as forward selection and backward selection, but I won’t discuss those here.

Moving From Statistical Metrics to Financial Metrics

In a business context, statistical metrics like f1-score are important, but less important than financial metrics like expected revenue and expected profit. Let’s assume that every customer costs $1 to market to and if they convert, it generates a profit of $10.

Here, I choose a low, medium and high lambda and vary the cutoff probability. We can now calculate an expected profit for every version of the model and pick the best:

Profit

It looks like about 8% gives the highest profit for each of the values of lambda. The high lambda model can generate the highest profit in this case, but the result seems unstable. It would be a good idea to perform k-fold cross validation on this particular model and test it on other random subsets of the data to ensure good performance. In general, the results for low and medium lambda models look relatively stable and repeatable.

Other Methods

There are a handful of other methods that are good for lead scoring classification.

Linear & quadratic discriminant analysis (LDA & QDA) approaches fit multivariate gaussian distributions to each class in the response variable. LDA assumes the same covariance in each class, and as a result has only linear decision boundaries. QDA fits covariance within each class and thus allows for more complex decision boundaries. QDA may be better if you have more data, or fewer predictors while LDA may be better with less data or more predictors.

K-nearest neighbors (KNN) is a non-parametric approach, meaning it makes no assumptions about the form of the relationship between variables and response. For any given instance, the KNN looks at the nearest K labeled instances and predicts the majority class. The only hyerparameter to fit is K, lower K is very flexible, higher K is less flexible.

Support vector machines (SVM) can be very good if the relationship between predictors and response is non-linear. SVM is actually similar to linear regression, but uses a function called a kernel and the so-called ‘kernel trick’ to find similarity between instances in a higher dimensional space. SVM is worth a try since it can find important non-linear relationships. However, as with many sophisticated modeling strategies, model intelligibility, or understanding of the significance of different factors, is decreased.

More Sophisticated Models

The difficulty of implementing more sophisticated models like neural networks and gradient boosting has decreased. For some classes of data, such as images, audio and text, neural networks offer much superior performance.

In business-oriented contexts, simple models can be better. You will need to explain the model and get buy-in. Simple models tend to be easier to implement and have simpler failure modes. They can also be important for inference, which means reasoning about the relationship between inputs and outputs. For example, if you notice that a certain zip code has customers that are highly likely to convert, perhaps you decide to invest in more customer acquisition in that area.

An additional consideration is that distributions of customers and customer preference will change over time. Voice and speech classification models will remain nearly fixed over time, allowing for the use of exquisitely fit neural network models. On the other hand, simple, robust, low variance, models may provide better performance in a business context where customers tend to evolve over time.

Being Aware of Data Sourcing

There are many biases and sources of error to avoid. One key source of bias I will call the ‘gathered data bias’. This means the source of your training data is not the same as the source of data the model will be applied to. For example, in marketing, you may have good conversion data on a set of customers that the marketing team selected. However, the marketing team may have selected based on who they thought would convert, rather than being a random sample.

If you apply the trained model to the population in general, you may not get the results you expect. For instance, let’s say marketing selects only customers in one income band. It is impossible to calculate an impact from customer income, so the true significance of this variable will not be discovered, and the model may not perform on customers in other income bands. This highlights the importance of investing in data. To train a universal model, you will have to get universal data, which means marketing to a totally random sample of the population.

References & Mathematical Detail

For mathematical details, there are excellent treatments available free.

For an introduction, Andrew Ng’s Coursera course ‘Machine Learning’ is good resource:

https://www.coursera.org/learn/machine-learning/

The PDF version of ‘Introduction to Statistical Learning’ is available free from the authors and is highly recommended. Much of this post is based around this book.

http://www-bcf.usc.edu/~gareth/ISL/

‘Elements of Statistical Learning’ is a much more rigorous treatment.

https://web.stanford.edu/~hastie/ElemStatLearn/

Advertisements