Analyzing resumes with data science

Ever wonder if there is a “secret code” for resumes, some key words that will actually make you stand out? It turns out there are indeed some very characteristic differences between experienced and novice resumes.

Gathering Resume Data

Resumes from an resume search were used to analyze the differences between experienced and inexperienced resumes. For maximum contrast, “Less than 1 year” and “More than 10 years” of experience resumes were included. To cover the full range of job titles, I searched for every job title included in the Bureau of Labor Statistics Occupational Outlook Handbook. I downloaded the text of 50 resumes of each job title and each experience level.

For a copy of the data used in this analysis, contact me.

Cleaning Resume Text

The full code I used for this analysis can be found on in my notebook at github.

The first step is to do some basic text extraction and cleaning. I saved my data in a JSON file so we must handle that. All punctuation was removed. For some textual analysis, we like to preserve sentences structure. This is the case for techniques that use word order like RNNs and skipgram/word2vec/doc2vec. However, in this case we’re only looking at word and document frequency so ordering and sentence demarcation don’t matter.

There was a string that was inserted by Indeed into a lot of resumes:

see more occupations related to this (activity|skill|task)

I also removed this string. Additionally, if after processing, a job description was an exact duplicate of another, that probably indicates it was a duplicate posting and duplicates were not counted more than once.

Analyzing Resume Text

After processing and cleaning the resume text, we are left with a corpus of documents–each document a string of lower case words. We can now move to analyzing these resumes.

There are some fairly simple things we could do to analyze word differences between different types of resumes. For example, we could count all the words in the corpus and look at how many times each word appears. We could then break it into each sub-corpus (high and low experience) and look for maximum differences. This might yield some interesting results. It may be the case that there is some difference, whether real or coincidental (i.e. population or sample), in the occurrence of words like ‘the’ or ‘is’. These differences will appear largest when we are looking only at raw term frequency.

We are probably more interested in the differences when it comes to words that are distinctive. For this we can use tfidf to vectorize our documents, where rare words are weighted more highly. It’s worth pointing out that you can decide how many words to include in the document vectors. Too many and calculations slow down and you will be swamped with possibly spurious results. Too few and you will not see any interesting results. I decided to limit it to 8,000 words on a corpus of about 2,000 total documents.

After vectorizing the documents, we can average the tfidf score for each word across each sub-corpus and take the difference in average score (e.g. experienced minus inexperienced). If we order the words in order of score difference (ascending), we will get the words in order of distinctively inexperienced to distinctively experienced.

For example, these are the top five most experienced and inexperienced words:


Getting Automatic Word Recommendations Using Word Vectors

Once we have the distinctively inexperienced words, we would like to know what experienced words to use instead. We could inspect each inexperienced word and choose corresponding experienced words manually. We can also automate this process with word vectors.

We previously used tfidf scores to generate document vectors–i.e. a numerical representation of a document. Word vectors are a numerical representation of an individual word. Using word vectors, we can automatically identify words with similar meanings. Pre-trained word vectors are available which are trained using various corpuses like Twitter or Wikipedia and available to download from Stanford’s Website.

After loading these word vectors, we can map from each inexperienced word to the most similar experienced words. Here is a sampling of the results–which you can see in full on the notebook. Below. all the words to the left of the arrow are inexperienced, and to the right of the arrow they are experienced.


I really like “excellent” -> “quality”

Curated Resume Recommendations

Not all of the recommendations make total sense. This is a typical result for machine-aided natural language processing (NLP) tasks. The machines get us close, but often we need a little bit of manual curation to get the best recommendations. I manually selected what I thought were some of the best recommendations from the list.

  • “assisted with” -> “provided support for”
  • “helping” or “assisting” -> “supporting”
  • “problem” -> “need”
  • “good” or “great” -> “quality”
  • “organize” -> “coordinate”
  • “customer” -> “client”
  • “task” -> “projects”
  • “academic” -> “technical”
  • “communication” -> “management”
  • “cleaning” -> “maintenance”


So what are the conclusions that we can draw from this? Are there any overall trends with these words?

It seems that experienced words are more general in meaning. Academic becomes technical, assisted becomes supported, excellent becomes quality. There is a lack of specificity in these terms. As you gain experience in your career and move up, you have to be able to talk about your work in more general terms. Engineers talk to other engineers mostly, but engineering managers must talk to accountants, account managers, executives and so forth. The trend of more general resume words mirrors this career arc.

As well as more general meaning, there is the choice of words which carry fewer connotations. While ‘cleaning’ is not inherently any less important than ‘maintenance’, it is associated with lower status jobs while maintenance is blank and neutral in its inflection.

Finally, some of it could just be jargon. There are trends and fashion in business language. Being able to use the right jargon is a way of showing you pay attention and can keep up with the latest trends. In fact, if more inexperienced people start to move in and co-opt the experienced words, the jargon will definitely shift.

Overfitting and Supervised Learning

I want to make one final note about the possibility of “overfitting” in this analysis. I did not test the generalizability of these trends by splitting into train and test data and building an actual model like logistic regression. Therefore this is not as definitive as it could be. Unfortunately I am somewhat limited in the amount of training data I have available and the dimensionality of the document vectors is high. Some potential approached here could include clustering or dimensionality reduction. I may revisit this data set again in the future.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s