How I got into data science

In the third year of my PhD, I had a decision to make.

Research, especially the academic kind, just wasn’t working for me. I was getting results and publications, but knew that it could never be a lifelong career for me. Things moved slowly. I disliked the territorialism and being a master of abstractions, outside of the productive mainstream of the economy. Then, of course, there were the practical considerations. Many years of post-doctoral drudgery stood between me and any long-term academic job. There was no guarantee for a long-term job, and in the meantime the salary was a pittance. If not academics, what would my career be?

Management consulting

I had heard about management consulting and decided to look into it. I liked the idea of helping businesses with difficult challenges. At around this time, companies like McKinsey and Boston Consulting Group were ramping up their recruitment of PhD graduates. I went to several info sessions and continued on the path of management consulting.

‘Case interviews’ are a kind of practice for consulting where you are asked to analyze a business problem and make a recommendation. It requires understanding of different kinds of businesses, mental math, communication and presentation skills. It is the main format for interviews at management consulting companies. I practiced case interviews for several weeks and began to progress in my skill.

Deciding on data science

One day in the Fall of 2015 while riding my bike to UCLA, I changed my mind. To this day, I believe I had the ability to go into management consulting, but not the drive to be excellent. While I found the case interviews interesting on a surface level, I couldn’t help wondering if there were something else that better fit my interests. While leaving the door to management consulting open, I decided to spend my time developing my data science skills instead.

My research in computational biology was very mathematical. I spent most of my days tinkering with MATLAB code that represented systems of biochemical reactions. The crux of the research was deeply tied to something called stability theory. Dynamical systems, whether they represent biochemistry or aircraft, can have an essentially stable or unstable nature. Our research was predicated on the idea that stability was important for biological systems. While all of this is very theoretical, mathematical and interesting in its own ways, it was a long way from data science.

Still, there was enough overlap that a move to data science made sense. Two former members of my lab had made exactly that move, and found jobs at prominent bay area companies. I also thought that the work of data science was probably a better fit for my personality.

Starting the journey

I went to events at UCLA for PhD’s interested in non-academic careers, and a panel discussion on data science. I began networking with everyone I could. I asked sympathetic professors if any former students had become data scientists and found several meaningful connections.

When I told my wife I wanted to be a data scientist, she bought me a book called Data Science for Business, which I still think is probably the best overall entry point for people interested in data science.

From these early days, I spent nearly all my free time and a good portion of my “work” time on learning and practicing data science. I had a lot of energy and commitment to my goal, but I had no idea how much work was in front of me.

My first side project

My initial forays were more about web development than data science. I instinctively knew that for my side projects to impress potential employers, they had to look good and be on the web, a click away in any browser window.

A family member had some credits for Microsoft Azure, so I had access to a Linux VM for development and experimentation (I later found out you can get essentially the same thing free through Amazon Web Services free tier). I started the long process of piecing together how the internet works, what a web server is, what a web application is, and how to deploy one.

My first side project (since decommissioned) was a solar panel calculator. Every hour, it downloaded data from the National Weather Service for the projected cloud cover all over the US. Processing this data turned out to be a huge challenge and almost derailed the entire project. The data is transmitted in a format called GRIB which is specified by the World Meteorological Organization.

I remember downloading a hex editor and looking at the GRIB data byte by byte and matching it up to the specification. I discovered a library in C, with Python handles which would handle the GRIB data for me. After several days of attempts, I managed to get them installed on my server. They provided a relatively convenient, if slow, way to query the cloud forecast data. This first success, however insignificant, felt huge to me at the time, and fueled my confidence to continue.

After solving my GRIB challenge, I found another calculator, implemented in Javascript, that calculated the sun’s elevation at any time at any place on earth, specified by latitude and longitude. I also tracked down a database of zip codes and latitude/longitude coordinates. The final application would calculate solar intensity at any zip code in the US every hour for the next week based on the predicted cloudiness and calculated solar elevation. It would then plot this using D3.js.

All said, it took several months of painstaking trial and error to finally get it launched in early 2016. It was a powerful moment of self-actualization. It made me realize I could actually do it. I could actually become a data scientist. I had blazed through interminable weeks of syntax errors and clueless flailing on the Linux command line. I worked through self-doubt to build a working math-based web application.

Leveling up

While my first successful side project increased my confidence, building a solid theoretical background was also essential. I took online courses, read textbooks and practiced Python skills. I started with Coursera’s Machine Learning course. It’s a technically rigorous course, and, I believe, has shaped the outlook and vocabulary of the entire field. Next I took the Biostatistics courses, which are also excellent and rigorous. For all the materials I’ve covered, check this post.

Deep Dive

In Spring 2016, a program called Deep Dive Data Science was advertised at the UCLA campus. A former UCLA PhD student was looking for students to mentor in data science. There were a series of lectures on algorithms and machine learning, which were perfect for me. The algorithms portion turned out to be an important part of interviewing for tech jobs, including those in data science.

There were also number of challenging homework assignments, which I kept in a git repository. As summer approached, I started working on another side project. This one is still active. It required more practical data science skills like data scraping, score functions and data visualization. After deploying a nice, visually appealing and accessible side project, and with my technical foundations in place, I finally felt ready for the job market.

Applying for jobs

The application process was painful, as most people have experienced. Despite the hype about plentiful data science jobs, it was a tough road with a lot of rejection. I started applying in June 2016, and carefully tracked all applications and any other interactions with employers. I suspected my Bioengineering degree was a disadvantage relative to other fields like Statistics or Computer Science.

The months dragged on and I set my thesis defense date in November. I had gotten a few phone interviews, but they weren’t leading to anything yet. If I graduated without an offer, my back would really be against a wall.

Finally in October, I got my break: a data science internship at a company that at the time was called Demand Media. I had applied through the normal channels and completed a coding challenge. After an onsite interview that went well, I got an offer. I could have felt let down that all I could manage to find after completing a PhD was an internship, but I was ecstatic. It was the ultimate validation of my efforts. After more than a year of intense preparation and study, I had my data science job. I knew that the first job was the hardest, many people had told me so. And here I was getting over that monumental hurdle.

A year and a half after starting the internship, it could not have worked out better. After a few months I was hired full time. I enjoy the work and I have no plans to change jobs for the time being. My work is valued and the company culture and prospects are good. Still, I am frequently contacted by recruiters and am confident I could get another job if I wanted to.

Still learning

I continued to build out my technical skills with additional courses and textbooks. At work, I also learned a lot about how a tech stack works and how to use Python for data analysis and transformation.

I’m now broadening into other fields including information security, business, management, finance, public speaking and negotiation. It’s hard to say exactly what these tools will do for me, but they have great general utility and I suspect they will be useful in some way or other in the future. I learned to trust my instincts in my data science hunt and I will continue to do so.

Lessons learned

The number one lesson I took away from the experience is to start small and build your confidence. We all know confidence is important. But how do you build it? It’s not something that can be manufactured out of nothing. In my experience, it has to be based on real accomplishments, especially the ones you set for yourself. It doesn’t matter how big the accomplishments look to other people. Even small accomplishments boost your confidence. Set a direction and accomplish something. If the bar is low enough that you can actually clear it, you’ll have something to celebrate and the energy to keep going.

Another lesson is to give your dreams time to grow, and to pursue them intensely. In my case it took over a year to finally land a data science job, and even then it was only an internship at first. All along I was pushing myself and learning. This lesson has a caveat. Not all dreams are possible. My dream of data science could have ended in failure. The calculation of if or when to change paths is difficult, and we all must make it based on our own individual situations.

Data science is not for everyone and that’s OK

This is a tricky path to navigate for many people who become interested in data science. I was gripped by a manic drive that got me to the finish line. Others who were at one time interested in data science didn’t have the same level of intensity and have moved on to other interesting and productive careers. I probably won’t be in data science for my whole career either. The first thing I want to tell people when they ask about careers in data science is to explore other options too. Data science had its moment and people got interested because of the hype, but maybe there’s something else they could find that would suit them much better.

Advertisements

Analyzing resumes with data science

Ever wonder if there is a “secret code” for resumes, some key words that will actually make you stand out? It turns out there are indeed some very characteristic differences between experienced and novice resumes.

Gathering Resume Data

Resumes from an indeed.com resume search were used to analyze the differences between experienced and inexperienced resumes. For maximum contrast, “Less than 1 year” and “More than 10 years” of experience resumes were included. To cover the full range of job titles, I searched for every job title included in the Bureau of Labor Statistics Occupational Outlook Handbook. I downloaded the text of 50 resumes of each job title and each experience level.

For a copy of the data used in this analysis, contact me.

Cleaning Resume Text

The full code I used for this analysis can be found on in my notebook at github.

The first step is to do some basic text extraction and cleaning. I saved my data in a JSON file so we must handle that. All punctuation was removed. For some textual analysis, we like to preserve sentences structure. This is the case for techniques that use word order like RNNs and skipgram/word2vec/doc2vec. However, in this case we’re only looking at word and document frequency so ordering and sentence demarcation don’t matter.

There was a string that was inserted by Indeed into a lot of resumes:

see more occupations related to this (activity|skill|task)

I also removed this string. Additionally, if after processing, a job description was an exact duplicate of another, that probably indicates it was a duplicate posting and duplicates were not counted more than once.

Analyzing Resume Text

After processing and cleaning the resume text, we are left with a corpus of documents–each document a string of lower case words. We can now move to analyzing these resumes.

There are some fairly simple things we could do to analyze word differences between different types of resumes. For example, we could count all the words in the corpus and look at how many times each word appears. We could then break it into each sub-corpus (high and low experience) and look for maximum differences. This might yield some interesting results. It may be the case that there is some difference, whether real or coincidental (i.e. population or sample), in the occurrence of words like ‘the’ or ‘is’. These differences will appear largest when we are looking only at raw term frequency.

We are probably more interested in the differences when it comes to words that are distinctive. For this we can use tfidf to vectorize our documents, where rare words are weighted more highly. It’s worth pointing out that you can decide how many words to include in the document vectors. Too many and calculations slow down and you will be swamped with possibly spurious results. Too few and you will not see any interesting results. I decided to limit it to 8,000 words on a corpus of about 2,000 total documents.

After vectorizing the documents, we can average the tfidf score for each word across each sub-corpus and take the difference in average score (e.g. experienced minus inexperienced). If we order the words in order of score difference (ascending), we will get the words in order of distinctively inexperienced to distinctively experienced.

For example, these are the top five most experienced and inexperienced words:

Cap2

Getting Automatic Word Recommendations Using Word Vectors

Once we have the distinctively inexperienced words, we would like to know what experienced words to use instead. We could inspect each inexperienced word and choose corresponding experienced words manually. We can also automate this process with word vectors.

We previously used tfidf scores to generate document vectors–i.e. a numerical representation of a document. Word vectors are a numerical representation of an individual word. Using word vectors, we can automatically identify words with similar meanings. Pre-trained word vectors are available which are trained using various corpuses like Twitter or Wikipedia and available to download from Stanford’s Website.

After loading these word vectors, we can map from each inexperienced word to the most similar experienced words. Here is a sampling of the results–which you can see in full on the notebook. Below. all the words to the left of the arrow are inexperienced, and to the right of the arrow they are experienced.

Cap

I really like “excellent” -> “quality”

Curated Resume Recommendations

Not all of the recommendations make total sense. This is a typical result for machine-aided natural language processing (NLP) tasks. The machines get us close, but often we need a little bit of manual curation to get the best recommendations. I manually selected what I thought were some of the best recommendations from the list.

  • “assisted with” -> “provided support for”
  • “helping” or “assisting” -> “supporting”
  • “problem” -> “need”
  • “good” or “great” -> “quality”
  • “organize” -> “coordinate”
  • “customer” -> “client”
  • “task” -> “projects”
  • “academic” -> “technical”
  • “communication” -> “management”
  • “cleaning” -> “maintenance”

Conclusions

So what are the conclusions that we can draw from this? Are there any overall trends with these words?

It seems that experienced words are more general in meaning. Academic becomes technical, assisted becomes supported, excellent becomes quality. There is a lack of specificity in these terms. As you gain experience in your career and move up, you have to be able to talk about your work in more general terms. Engineers talk to other engineers mostly, but engineering managers must talk to accountants, account managers, executives and so forth. The trend of more general resume words mirrors this career arc.

As well as more general meaning, there is the choice of words which carry fewer connotations. While ‘cleaning’ is not inherently any less important than ‘maintenance’, it is associated with lower status jobs while maintenance is blank and neutral in its inflection.

Finally, some of it could just be jargon. There are trends and fashion in business language. Being able to use the right jargon is a way of showing you pay attention and can keep up with the latest trends. In fact, if more inexperienced people start to move in and co-opt the experienced words, the jargon will definitely shift.

Overfitting and Supervised Learning

I want to make one final note about the possibility of “overfitting” in this analysis. I did not test the generalizability of these trends by splitting into train and test data and building an actual model like logistic regression. Therefore this is not as definitive as it could be. Unfortunately I am somewhat limited in the amount of training data I have available and the dimensionality of the document vectors is high. Some potential approached here could include clustering or dimensionality reduction. I may revisit this data set again in the future.

What I’m reading/watching/taking

Books, videos and courses I’ve done, am doing or want to do, for the curious and for my own reference.

Read/watched/took:

Reading/watching/taking:

Want to read/watch/take:

What should I add?

Linear Regression vs. Decision Trees: Handling Outliers

In regression tasks, it’s often assumed that decision trees are more robust to outliers than linear regression. See this Quora question for a typical example. I believe this is also mentioned in the book “Introduction to Statistical Learning”, which may be the source of the notion. Predictions from a decision tree are based on the average of the instances at the leaf. This averaging effect should reduce the effect of outliers, or so the argument goes.

However, in a notebook, I demonstrated that decision trees can sometimes react more poorly to outliers. Specifically, an outlier increased the sum of squared errors more in a decision tree thin in a linear regression. Depending on the specifics, a tree can put an outlier on its own leaf, which could lead to some spectacular failures in prediction.

This is not to say that linear regression is always, or even generally better at handling outliers. Rather, it is best not to assume one or the other technique will outperform in all cases. As usual, the specifics of a given application should dictate the techniques used.

Data vignette: men are worse drivers

This post is based on a notebook I wrote a couple years ago. I’d like to revisit and expand on it, as well as correct some errors. The original notebook is here.

In this post, I analyze traffic collision data from Los Angeles County in January 2012. The analysis is sound, but the conclusion in the title is meant to be taken with a grain of salt, due to the limitations of the data source, including the time span and specific location.

The source of the data is a California state traffic data system called SWITRS (Statewide Integrated Traffic Records System). The system allowed for the download of about 5,000 incidents at a time. The data are submitted by local law enforcement agencies based on traffic incidents they respond to. The system allows downloads in different categories: victims, collisions and parties. The parties data includes all parties in a collision, including parties at fault and not at fault. This becomes important in the contextualization of the data.

Male-female, Fault-no fault

The data contains many dimensions that allow for much more in-depth analysis. However, for our purposes, the analysis is quite simple.

The key to this analysis is to bundle up the data into a two-by-two contingency table and interpret it. Using pandas, the following result can be obtained:

Capture

We can clearly see that for drivers at fault, males outnumber females 2231 to 1369, immediately suggesting that males are far worse drivers when it comes to causing collisions.

Using the basis that males and females are roughly equal portions of the population, we would expect that if they drive with equal safety, they should be equally represented in drivers at fault for collisions.

If you’re paying attention, you probably notice a problem with this logic. While population is equal, men may be over-represented as drivers since they are more likely in the driving and delivery professions. In fact, with this data we can make some estimates about that.

Correcting errors in reasoning

In my notebook, I made an error in summing the at fault and not at fault as a representation of the overall driving population. This would have been a reasonable move if we had counted all the drivers who were not at fault, not just those who are in a collision. However, this is not feasible.

Instead, we take the the drivers in a collision and only those not at fault as a representation of the drivers on the road at any given time. As far as we know, there is nothing special about these drivers, they just happened to be in the wrong place at the wrong time. The total number of drivers (in or not in a collision) in LA County not at fault is much higher, roughly 5 million [1,2]. Under our assumptions, the ratio of male to female drivers is thus 2577:1938 or 1.33:1. Scaling out it’s about 2.15 million:2.85 million male:female drivers who are not at fault.

In comparison, the numbers of drivers at fault is quite small and is not significant to the overall numbers of drivers.

Now calculate the overall difference

The rate of male drivers causing an accident is thus 2271/2.85/100 = 7.97 drivers at fault / 10,000 male drivers.

The rate of female drivers causing an accident is thus 1369/2.15/100 = 6.37 drivers at fault / 10,000 female drivers.

The ratio of risk would then be 7.97/6.37 = 1.25 or a 25% greater risk for male vs. female drivers.

Conclusion: men are, on average, worse (more dangerous) drivers

 

Notes

[1] The actual number used here is an estimate. Intermediate values are therefore accurate only to the accuracy of the estimate. The number is important to show scale relative to the numbers of drivers at fault. However, the actual number would cancel out in the calculation of ratio between male and female risk.

[2] Using only number of drivers here is a bit misleading, since drivers are not on the road for equal amounts of time. In fact, it is reasonable to expect that number of drivers is roughly equal between men and women, but that men spend more hours driving, as they are more likely to be in the driving and delivery professions. Instead, we could estimate 5 million * 30 hrs / month = 150 million driver hours / month. Scaled out it is about 85.6 million:64.4 million driver hours male:female. The accident rates would then be: 2231/85.6/100 = 0.261 drivers at fault / 10,000 male driver hours and 1369/64.4/100 = 0.213 drivers at fault / 10,000 female driver hours.

For presentations, focus on narrative

You’ve collected the data, you’ve run the analysis, now you have to decide how to present. You’ve considered it from every angle, and you’re preparing a slide deck to match–detailed, lengthy and technical. Is this the right approach? Probably not.

Rule of thumb: include no more than one figure per topic

When you’re the technical expert, people aren’t always trying to prove you wrong. They expect you to do the work work correctly, until you give them reason not to. You don’t have to take them through the entire process of analyzing the data. One figure is enough to make the point and move on. More than that becomes boring and forgettable.

This was an adjustment for me. Academic seminar talks are all technical, and audience questions take the presenter through the weeds.

There are exceptions to this rule, but for most audiences, one figure is the limit of attention. Choosing one figure to make the point will help to clarify your message. That said, be ready to answer questions from many angles.

If they don’t want data, what do they want? Narrative

Instead of spending most time thinking about and preparing figures, spend your time thinking about narrative and story telling. What are the keys?

A crucial element of narrative is an upward or progressive arc. We like companies that grow, societies that evolve, people that improve themselves and narratives that move from bad to better:

“We had some trouble early in Q2, but learned our lesson and fixed the problem.”

“We sustained initial losses but our heavy investments paid off in the long run.”

Another part of memorable narratives is the goldilocks principle–not too much, not too little:

“We were over-invested in customer service. We were able to reduce costs without affecting our feedback scores.”

“This model achieves the right balance of simplicity and complexity and will maximize profit”

Another type of narrative that appeals to me is the contrarian narrative: the experts say X, but here’s why they’re wrong. This one’s more of a personal taste and should be used carefully.

Narratives operate at different levels and in different contexts

Business executives craft a narrative about their business that must satisfy investors, motivate employees, generate positive PR, and be stable over time.

In a similar way, data scientists must craft narratives about data that are consistent with the facts, satisfying to leadership, palatable to stakeholders and memorable.

So next time you’re presenting your data, think about narrative, and see how much impact you can have.

Lead Scoring with Customer Data Using glmnet in R

Lead Scoring

Lead scoring is an important task for business. Lead scoring is identifying which individuals in a population may convert (purchase) if marketed to, or assigning them a probability of converting, or determining how much value that individual may have as a customer. Properly using data to support this task can greatly benefit your business, and you personally if it is a skill you can bring to bear on customer data sets.

Lead scoring not a strictly defined concept, but can refer to many processes for identifying valuable customers. Depending on context, you may use customer purchase data, demographic data or web activity to inform your model. However, whether web data or demographic data is used as inputs for the model, the mathematical concepts and models discussed here will remain apply equally.

The R code for this post is available here: https://github.com/theis188/lead-scoring/blob/master/Lead%20Scoring.r

Classification Vs. Regression

Machine learning tasks mostly break down into either classification or regression. Regression tasks involve predicting a number (price of a home, for example), classification tasks involve labeling (will a customer convert or not). Lead scoring may be either classification (customer converts or not), or regression (how much will a custmer spend). I will focus on the classification problem. 

Bias & Variance

Bias and variance are two sources of error in parameter estimation. There is typically a trade-off. As error from bias decreases, error from variance increases. Usually there is a happy medium where total overall error is minimized.

Bias creates error when the model is too simple to capture the true nature of the process being modeled. When bias error is high, small changes. Variance error arises when the model is very sensitive to source data. When variance error is high, small changes in the observations used to train the model will drastically impact model parameter estimates.

Evaluating Model Performance

Model performance is frequently evaluated by cross-validation. Some portion of the data is set aside during training. This data, which the model has not seen before, is used to evaluate and choose the best model. There are other ways of evaluating models known as information criteria, but I will focus on cross-validation. So in this post, I’ll be splitting the data into a ‘train’ and ‘test’ set to evaluate model performance.

Statistical Measures of Performance

In lead scoring, a large majority of customers will not convert. Thus, simple metrics like accuracy may not be sufficient to measure performance. In a typical scenario, maybe 5% of customers convert, so a simple classifier that always predicts “no conversion” achieves 95% accuracy. This may sound impressive until you realize you will not find any leads with this classifier!

Precision is the fraction of instances labeled positive by the classifier that are truly positive. This may be very important when the consequences of false positive is high, perhaps a lie detector test. You can think of it is the ‘positive accuracy’ of the classifier.

Recall is the fraction of all positive cases that are correctly labeled by the classifier. In a scenario like disease testing, this may be very important, since the consequences of a missed positive are severe.

The f1-score is the harmonic mean between precision and recall. A perfect f1-score requires both perfect precision and recall.

Choice of Method

It can be hard to choose between different methods of classification for lead scoring tasks. In general, it is best to train many different types of models and compare. If you have a lot of data, using more flexible methods such as QDA or logistic regression with polynomial terms might be preferable, since the impact of variance error will be lower. With less data, you may consider simpler models or regularization, which is a method for penalizing model complexity. Overall model performance can be assessed using cross-validation and test set performance statistics.

Data Source

The data source I will use is the Caravan data set from the ‘Introduction to Statistical Learning’ textbook. It is a data set with 85 predictors and one outcome variable, ‘Purchase’, i.e. whether the customer converted. There are 5,822 customers and 348 (6%) converted. The data are described here: http://liacs.leidenuniv.nl/~puttenpwhvander/library/cc2000/data.html.

Test and Train

First, let’s split our data into test and train:

(The R code for this post is available here: https://github.com/theis188/lead-scoring/blob/master/Lead%20Scoring.r)

 

library(ISLR)
splits <- split_data(Caravan,0.2)
test_splits(splits)

We first load the ISLR package and then split the data. I define a number of helper functions which you can inspect in the source code for this post. The function ‘split_data’ splits the data “Caravan” into test and train (20%) and normalizes it. Normalization is a good practice when we are going to use regularization, which is a method of controlling the bias-variance tradeoff.  We then test the splits for size and normalization using ‘test_splits’.

Logistic Regression

Logistic regression uses a form similar to linear regression. It models probability using a transformation function so that predicted values always lie between 0 and 1. Because of the flexibility and popularity of this method, and the number of implementations available, I will spend most time on it.

There are many implementations of logistic regression in R, I will focus on the glmnet package. By default, glmnet fits 100 different models, each with a different level of regularization. More regularization (L1 by default in this package) will set more coefficients to 0, and reduce variance error.

In my implementation. We fit the models as such:
fit <- glmnet( matrix_from_df(splits$train_x), splits$train_y, family='binomial')

(Again, helper functions defined in source code for this post).

Logistic regression regression works in a method similar to accuracy maximization. In other words, it treats positive and negative examples as equally important! In these cases, classifiers are very reluctant to mark observations as potential conversions, since it’s most likely wrong.

We can test which of the observations had a greater than 50% chance of converting:
 fit <- glmnet( matrix_from_df(splits$train_x), splits$train_y, family='binomial')
 test_predict <- predict(fit,newx=matrix_from_df(splits$test_x) )

In my test/train split, only 3 instances, out of 1110 were marked positive. At least it got all 3 of those correct!
get_confusion_matrix(test_predict[,100]>0,splits$test_y)

     pred
test  FALSE TRUE
 No   1033 0  Yes    74 3

So is there anything we can do? Fortunately there is. By default, the logistic regression uses a cutoff of 50% probability. If the classifier sees a greater than 50% chance of conversion, it marks it positive. We can choose our own cutoff, or decision boundary. Since we know positive conversions are much more important than negative, we could choose, say 10%. If there is even a 10% chance of conversion, we want to mark it as positive.

Let’s use the 10% cutoff and see what levels of regularization are best:

Test_F1

Lambda, here, is a measure of regularization, and it’s usually a ‘not too much, not too little’ situation. We see that the F1 score on the test set is highest for lambda between 0.005-0.010 and is lower outside that range.

Feature Selection

There are a number of methods for feature selection. A simple method is known as L1 regularization. This simply penalizes the model based on the absolute value of all the coefficients. As a result the model sets many of the coefficients to 0. This is the default behaviour of glmnet. We can determine what are the most important variables and how they effect the prediction. In this case, I selected a high level of regularization and output the nonzero coefficients.
(Intercept)    MOPLLAAG MINKGEM   MKOOPKLA PPERSAUT   PBRAND APLEZIER
-2.85973655 -0.02207064  0.01358764 0.04784260 0.35163469  0.03212729 0.08691317

You can then use inference to guide your customer acquisition behavior. Here, PPERSAUT is highly positive, so it is highly correlated with conversion. I redid the model several times with different train/test splits and PPERSAUT appears to be significant most of the time.

In the data source, the response variable is if the customer purchased mobile home insurance or not. PPERSAUT is whether or not the customer has car insurance. Thus, it seems someone with one type of insurance is more likely to buy other kinds.

There are other methods for feature selection such as forward selection and backward selection, but I won’t discuss those here.

Moving From Statistical Metrics to Financial Metrics

In a business context, statistical metrics like f1-score are important, but less important than financial metrics like expected revenue and expected profit. Let’s assume that every customer costs $1 to market to and if they convert, it generates a profit of $10.

Here, I choose a low, medium and high lambda and vary the cutoff probability. We can now calculate an expected profit for every version of the model and pick the best:

Profit

It looks like about 8% gives the highest profit for each of the values of lambda. The high lambda model can generate the highest profit in this case, but the result seems unstable. It would be a good idea to perform k-fold cross validation on this particular model and test it on other random subsets of the data to ensure good performance. In general, the results for low and medium lambda models look relatively stable and repeatable.

Other Methods

There are a handful of other methods that are good for lead scoring classification.

Linear & quadratic discriminant analysis (LDA & QDA) approaches fit multivariate gaussian distributions to each class in the response variable. LDA assumes the same covariance in each class, and as a result has only linear decision boundaries. QDA fits covariance within each class and thus allows for more complex decision boundaries. QDA may be better if you have more data, or fewer predictors while LDA may be better with less data or more predictors.

K-nearest neighbors (KNN) is a non-parametric approach, meaning it makes no assumptions about the form of the relationship between variables and response. For any given instance, the KNN looks at the nearest K labeled instances and predicts the majority class. The only hyerparameter to fit is K, lower K is very flexible, higher K is less flexible.

Support vector machines (SVM) can be very good if the relationship between predictors and response is non-linear. SVM is actually similar to linear regression, but uses a function called a kernel and the so-called ‘kernel trick’ to find similarity between instances in a higher dimensional space. SVM is worth a try since it can find important non-linear relationships. However, as with many sophisticated modeling strategies, model intelligibility, or understanding of the significance of different factors, is decreased.

More Sophisticated Models

The difficulty of implementing more sophisticated models like neural networks and gradient boosting has decreased. For some classes of data, such as images, audio and text, neural networks offer much superior performance.

In business-oriented contexts, simple models can be better. You will need to explain the model and get buy-in. Simple models tend to be easier to implement and have simpler failure modes. They can also be important for inference, which means reasoning about the relationship between inputs and outputs. For example, if you notice that a certain zip code has customers that are highly likely to convert, perhaps you decide to invest in more customer acquisition in that area.

An additional consideration is that distributions of customers and customer preference will change over time. Voice and speech classification models will remain nearly fixed over time, allowing for the use of exquisitely fit neural network models. On the other hand, simple, robust, low variance, models may provide better performance in a business context where customers tend to evolve over time.

Being Aware of Data Sourcing

There are many biases and sources of error to avoid. One key source of bias I will call the ‘gathered data bias’. This means the source of your training data is not the same as the source of data the model will be applied to. For example, in marketing, you may have good conversion data on a set of customers that the marketing team selected. However, the marketing team may have selected based on who they thought would convert, rather than being a random sample.

If you apply the trained model to the population in general, you may not get the results you expect. For instance, let’s say marketing selects only customers in one income band. It is impossible to calculate an impact from customer income, so the true significance of this variable will not be discovered, and the model may not perform on customers in other income bands. This highlights the importance of investing in data. To train a universal model, you will have to get universal data, which means marketing to a totally random sample of the population.

References & Mathematical Detail

For mathematical details, there are excellent treatments available free.

For an introduction, Andrew Ng’s Coursera course ‘Machine Learning’ is good resource:

https://www.coursera.org/learn/machine-learning/

The PDF version of ‘Introduction to Statistical Learning’ is available free from the authors and is highly recommended. Much of this post is based around this book.

http://www-bcf.usc.edu/~gareth/ISL/

‘Elements of Statistical Learning’ is a much more rigorous treatment.

https://web.stanford.edu/~hastie/ElemStatLearn/

Data Science From Scratch

I visit Quora regularly and am always surprised by the number of people asking how to become a data scientist. It’s a fascinating field, and one I was able to (mostly) “bootstrap” into, out of a quantitative PhD (bioengineering). This is my simple guide on how to become a data scientist.

Get in touch if you have comments about this guide, or if you have questions about data science or careers.

Understand Programming

For true beginners, tutorials like W3Schools will get you off the ground.

For practical skills, Cracking the Coding Interview is a highly regarded guide. I’d also recommend coding challenges like leetcode and codewars.

It’s useful to know about data structures and algorithms. I learned this topic mostly ad hoc. CLRS has been recommended to me, though I haven’t read it myself.

For background on the history and culture of programming, Paul Ford’s book-length essay What is Code? is truly indispensable. It’s the best essay I have ever read on programming and maybe the best ever written.

Python and SQL are the most important languages for data science.

Understand Machine Learning

The absolute best place to begin is Andrew Ng’s Machine Learning Course. Regression, classification, gradient descent, supervised, unsupervised. Even a brief piece on neural networks. Essential knowledge, good presentation, good depth, good assignments.

“Introduction to Statistical Learning” and “Elements of Statistical Learning” are a great resource, expanding on Ng in some ways. ESL goes into considerable depth, and the PDFs are available online from the authors.

Machine learning has moved on and it’s good to be familiar with the new techniques, namely XGBoost and deep learning.

For XGBoost, reading through the documentation is a good place to start. There’s also a scholarly paper about it.

For neural networks, I recommend Stanford 231n Winter 2016. I think there is a newer version up, but I have not studied it yet.

Understand Statistics

In some ways, this may be the least important piece of it, at least for starting out. It’s rarer to get an interviewer who really understands statistics. Still, I very much recommend understanding statistics, and I value the statistics I have learned. In the long run, it will help you justify and be confident in your results.

The point of entry here is Brian Caffo’s Biostatistics Boot Camp. It’s a little bit dry, but I truly valued the precision and rigor in this course. There is also a part 2.

I read the textbook “Statistical Rethinking” on Bayesian modeling and thought it was excellent. There’s also a free PDF online from the author of “Doing Bayesian Data Analysis”, which has been recommended to me.

“Mathematical Statistics and Data Analysis” is an excellent, comprehensive reference, but maybe not worth reading all the way through for most.

Other areas of statistics I want to learn more about: information theory, latent variable modeling, factor analysis, linear modeling.

Keeping Up

That’s it. If you understand programming, machine learning and statistics, you are well on your way to landing a data science job.

It’s important to stay on top of the game. Podcasts, meetups and forums are good for this. For podcasts, the Google Cloud podcast is good, and Data Skeptic is also not bad. The best forum I know of is probably Hacker News.

It’s also important to explore your own interests as well. I’ve been finding myself more towards the ETL/data engineering/infrastructure end of things rather than pure analysis.

Deep Neural Networks: CS231n & Transfer Learning

Deep learning (also known as neural networks) has become a very powerful technique for dealing with very high dimensional data, i.e. images, audio, and video. As one example, automated image classification has become highly effective. This task consists of putting an image into one of a certain number of classes.

Look at the results of the ImageNet Large Scale Visual Recognition Challenge. This is an annual challenge for image classification, recognition and detection algorithms, for images from 1,000 categories. The top-5 error rate has decreased from 28% in 2010, to 26% in 2011, to 15% in 2012, then 11%, 7%, 3.5%, to 3% this year. (Classification error). Other tasks continue to improve with the use of neural networks, especially convolutional and recurrent neural networks.

The number of resources for learning about neural networks has also multiplied dramatically. One resource I can recommend firsthand is the CS231n Stanford course about image recognition. The course syllabus, including slides and lecture notes (and Jupyter notebook assignments) are available online. Additionally, the lectures are available on YouTube. I’m up to lecture 5 and I can highly recommend both the slides and YouTube lectures. There is really excellent theoretical background on topics like the history of image processing, loss functions, backpropagation, but also practical advice on weight initialization, transfer learning, learning rate, regularization and more.

Another resource that has been linked recently on HackerNews is the Yerevann guide to deep learning, which seems to be a good, deep source of information.

 

The folks at TensorFlow just keep improving their offerings. I recently implemented their transfer learning using Inception v3 using 8 categories I chose from ImageNet. It was surprisingly easy to train my own classifier, which was quite effective (top 1 error rate <10%)! The only real issue I had was some bazel errors which were resolved by upgrading my version of bazel. Training on roughly 8,000 images took about 12 hours on a MacBook Pro using CPU only. To be more specific, the bottleneck phase took about 12 hours, while the actual training about 20 minutes. Using ~1,000 images per category, is probably not totally necessary for an effective classifier, so you can likely cut down on this time dramatically.

 

Collecting Neighborhood Data

Since deploying my latest web application, a Los Angeles Neighborhood Ranker, I’ve wanted to explain the process of gathering the data. The first step was to decide which neighborhoods to use, what they’re called, and how they’re defined, geographically.

The first stop I made is the LA Times Mapping LA project. It has a mapping feature for many Los Angeles neighborhoods. Each neighborhood has an individual page where the neighborhood is plotted on a map. After inspection of the page’s source code, the coordinates of the neighborhood’s boundaries can be discovered:

(From http://maps.latimes.com/neighborhoods/neighborhood/santa-monica/)

"geometry": { "type": "MultiPolygon", "coordinates": 
[ [ [ [ -118.483981, 34.041635 ], [ -118.483766, 34.041430 ]...

This information can be collected and stored for further use via web scraping using a Python library like urllib2 to open the page and regular expression to find the coordinates. The following shows the essential steps:

import urllib2
import re
### Code here to define fullurl
webpage = urllib2.urlopen(fullurl)
for lines in webpage:
  d = re.search(r"geometry.+(\[ \[ \[ \[.+\] \] \] \])",lines)
  if d:
    ### store d.group(1)

Storing these coordinates to text allows them to be used in mapping and data gathering. Mapping is accomplished with leaflet.js and Mapbox, and I can describe that later. For now I will talk about how the nieghborhood coordinates help in the data collection process.

For the neighborhood ranker, I needed neighborhood-level data in categories like house price and others. Unfortunately, the existing real estate APIs (Zillow e.g.) don’t support neighborhood-level data, nor would it be likely they have exactly the same neighborhood definition.

The Zillow API, which I used for real estate data, only supports single house queries based on street address. So, how to go from single Zillow queries to average house price in a neighborhood? A first natural instinct may be to collect information on every house in a neighborhood and compute the average.

However, perhaps it isn’t necessary to get every value. Consider surveys like the census and employment situation surveys. These gather important information about populations without getting information from every member of the population. Instead they use knowledge of statistics and error to estimate the values and quantify uncertainty.

Thus, getting every house in the neighborhood is unnecessary. But how many are necessary? We estimate the standard error of the mean using sample information:

SE,mean = s/sqrt(n)

Where s is the sample standard deviation and n is the number of samples.

Thus, if the distribution of prices for each neighborhood is relatively tight, you wouldn’t need many samples at all to get a good estimate of population mean.

In general what I found is that only about 5 samples were needed per neighborhood to get a good estimate, with estimated error about 10-20% of sample average. This is sufficient to rank the neighborhoods as they vary in average price from $300,000 to $5,000,000 or about 1500%.

But how do you generate random samples within neighborhoods? Zillow doesn’t support it. One thing that may help is to first generate a random coordinate-pair within the neighborhood. I use the following:

def get_random_point_in_polygon(poly):
  (minx, miny, maxx, maxy) = poly.bounds
  while True:
    p = Point(random.uniform(minx, maxx), random.uniform(miny, maxy))
    if poly.contains(p):
      return p

The poly argument is a Polygon object from the Shapely library, which defines the neighborhood boundry. The function returns a randomly generated Point object which is inside the Polygon object. We can actually plot the neighborhood and look at the coverage we get with let’s say, 50 points.

xy = coords['santa-monica']

xylist = zip(*xy)
p = Polygon(xylist)
points = [get_random_point_in_polygon(p) for i in range(50)]

ax = plt.subplot()
ax.plot(*xy)
for point in points: ax.plot(point.x,point.y,marker = 'o')

plt.show()

This is the result:

samopts

50 random points in the Santa Monica neighborhood.

There is one more step. Zillow accepts API queries using addresses, not coordinates. You can get addresses from coordinates by using a ‘geocoding’ API. I used Google’s but there are others available. Using the address information, the Zillow API may be called, and the price information extracted. A similar method can be used for Yelp API and restaurant density.

On a technical note, this method may not give a completely random sample from a neighborhood. A random point will be assigned to the nearest address during the geocoding step. Thus, addresses in high density areas are less likely to be chosen. If price correlates with density within a neighborhood, this could yield biased samples.