What algorithms should I use to perform job classification based on resume data?



Note that I am doing everything in R.

The problem goes as follow:

Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .

Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.

My original idea: make this a supervised learning problem. Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.

Update Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:

I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...

This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.

Is this approach wrong ? Please correct me if you think my approach is wrong.

Question 3: The tricky part is: how to identify and extract the keywords ? Using the tm package in R ? what algorithm is the tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.

Any ideas would be great.


Posted 2014-07-03T16:11:22.637

Reputation: 411



Check out this link.

Here, they will take you through loading unstructured text to creating a wordcloud. You can adapt this strategy and instead of creating a wordcloud, you can create a frequency matrix of terms used. The idea is to take the unstructured text and structure it somehow. You change everything to lowercase (or uppercase), remove stop words, and find frequent terms for each job function, via Document Term Matrices. You also have the option of stemming the words. If you stem words you will be able to detect different forms of words as the same word. For example, 'programmed' and 'programming' could be stemmed to 'program'. You can possibly add the occurrence of these frequent terms as a weighted feature in your ML model training.

You can also adapt this to frequent phrases, finding common groups of 2-3 words for each job function.


1) Load libraries and build the example data


doc1 = "I am highly skilled in Java Programming.  I have spent 5 years developing bug-tracking systems and creating data managing system applications in C."
job1 = "Software Engineer"
doc2 = "Tested new software releases for major program enhancements.  Designed and executed test procedures and worked with relational databases.  I helped organize and lead meetings and work independently and in a group setting."
job2 = "Quality Assurance"
doc3 = "Developed large and complex web applications for client service center. Lead projects for upcoming releases and interact with consumers.  Perform database design and debugging of current releases."
job3 = "Software Engineer"
jobInfo = data.frame("text" = c(doc1,doc2,doc3),
                     "job" = c(job1,job2,job3))

2) Now we do some text structuring. I am positive there are quicker/shorter ways to do the following.

# Convert to lowercase
jobInfo$text = sapply(jobInfo$text,tolower)

# Remove Punctuation
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[[:punct:]]"," ",x))

# Remove extra white space
jobInfo$text = sapply(jobInfo$text,function(x) gsub("[ ]+"," ",x))

# Remove stop words
jobInfo$text = sapply(jobInfo$text, function(x){
  paste(setdiff(strsplit(x," ")[[1]],stopwords()),collapse=" ")

# Stem words (Also try without stemming?)
jobInfo$text = sapply(jobInfo$text, function(x)  {
  paste(setdiff(wordStem(strsplit(x," ")[[1]]),""),collapse=" ")

3) Make a corpus source and document term matrix.

# Create Corpus Source
jobCorpus = Corpus(VectorSource(jobInfo$text))

# Create Document Term Matrix
jobDTM = DocumentTermMatrix(jobCorpus)

# Create Term Frequency Matrix
jobFreq = as.matrix(jobDTM)

Now we have the frequency matrix, jobFreq, that is a (3 by x) matrix, 3 entries and X number of words.

Where you go from here is up to you. You can keep only specific (more common) words and use them as features in your model. Another way is to keep it simple and have a percentage of words used in each job description, say "java" would have 80% occurrence in 'software engineer' and only 50% occurrence in 'quality assurance'.

Now it's time to go look up why 'assurance' has 1 'r' and 'occurrence' has 2 'r's.


Posted 2014-07-03T16:11:22.637

Reputation: 485

I would love to see your example. – user1769197 – 2014-07-03T22:03:13.677

Updated with quick example. – nfmcclure – 2014-07-03T22:52:43.353


Just extract keywords and train a classifier on them. That's all, really.

Most of the text in CVs is not actually related to skills. E.g. consider sentence "I'm experienced and highly efficient in Java". Here only 1 out of 7 words is a skill name, the rest is just a noise that's going to put your classification accuracy down.

Most of CVs are not really structured. Or structured too freely. Or use unusual names for sections. Or file formats that don't preserve structure when translated to text. I have experience extracting dates, times, names, addresses and even people intents from unstructured text, but not a skill (or university or anything) list, not even closely.

So just tokenize (and possibly stem) your CVs, select only words from predefined list (you can use LinkedIn or something similar to grab this list), create a feature vector and try out a couple of classifiers (say, SVM and Naive Bayes).

(Note: I used a similar approach to classify LinkedIn profiles into more than 50 classes with accuracy > 90%, so I'm pretty sure even naive implementation will work well.)


Posted 2014-07-03T16:11:22.637

Reputation: 2 751

1@ffriend What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks – Khalid Usman – 2016-10-04T13:27:42.153


@KhalidUsman: since you already work with NLTL, take a look at named entity recognition tools, especially "Chunking with Regular Expressions" section. In general, you would want to use a dictionary of keywords (e.g. "years", "C", etc.) and simple set of rules (like "contains 'C'" or "<number> years") to extract named entities out of a free-form text.

– ffriend – 2016-10-04T18:41:32.290

Say I am analyzing linkedin data, do you think it would be a good idea for me to merge the previous work experience, educations recommendations and skills of one profile into one text file and extract keywords from it ? – user1769197 – 2014-07-05T14:46:08.110

LinkedIn now has skill tags that people assign themselves and other users can endorse, so basically there's no need to extract keywords manually. But in case of less structured data - yes, it may be helpful to merge everything and then retrieve keywords. However, remember main rule: try it out. Theory is good, but only practical experiments with different approaches will reveal best one. – ffriend – 2014-07-05T20:33:05.633

@ffriend thank you for your post. I am just doing a thesis about resume recognition but just in IT industry. I have around 50CVs right now and have some script to tokenize them, lematize, remove stop-words. So I end up having an array of around 50 to 100 strings from each resume. I would like to classify them by the position like frontend, java, data scietist and maybe junior/mid/senior. How to achieve this? I am very new to Python and looking for around 2-3 alghoritm for this but I want to use real ones not just write a function to search for "java" inside array... – jake-ferguson – 2019-11-01T19:34:00.417

@ffriend, How do we get that keyword list ? – NG_21 – 2016-01-28T09:53:32.480

@NG_21 the easiest way is just to collect them manually. If you have too much jobs to handle it yourself, you can use crowdsourcing platform like Amazon MTurk to do it quickly and accurately. This is not very technical solution, but will cost much less than implementing algorithm for keyword extraction. – ffriend – 2016-01-28T10:24:39.217


This is a tricky problem. There are many ways to handle it. I guess, resumes can be treated as semi-structured documents. Sometimes, it's beneficial to have some minimal structure in the documents. I believe, in resumes you would see some tabular data. You might want to treat these as attribute value pairs. For example, you would get a list of terms for the attribute "Skill set".

The key idea is to manually configure a list of key phrases such as "skill", "education", "publication" etc. The next step is to extract terms which pertain to these key phrases either by exploiting the structure in some way (such as tables) or by utilizing the proximity of terms around these key phrases, e.g. the fact that the word "Java" is in close proximity to the term "skill" might indicate that the person is skilled in Java.

After you extract these information, the next step could be to build up a feature vector for each of these key phrases. You can then represent a document as a vector with different fields (one each for a key phrase). For example, consider the following two resumes represented with two fields, namely project and education.

Doc1: {project: (java, 3) (c, 4)}, {education: (computer, 2), (physics, 1)}

Doc2: {project: (java, 3) (python, 2)}, {education: (maths, 3), (computer, 2)}

In the above example, I show a term with the frequency. Of course, while extracting the terms you need to stem and remove stop-words. It is clear from the examples that the person whose resume is Doc1 is more skilled in C than that of D2. Implementation wise, it's very easy to represent documents as field vectors in Lucene.

Now, the next step is to retrieve a ranked list of resumes given a job specification. In fact, that's fairly straight forward if you represent queries (job specs) as field vectors as well. You just need to retrieve a ranked list of candidates (resumes) using Lucene from a collection of indexed resumes.


Posted 2014-07-03T16:11:22.637

Reputation: 1 476

@Debasis What is the best way to extract "experience" = '5 years' , "Language" = 'C' from the following sentence. "I have spent 5 years developing bug-tracking systems and creating data managing system applications in C". I used Rake with NLTK and it just removed the stopword + punctuations, but from the above sentence i don't need words like developing, bug-tracking, systems, creating, data etc. Thanks – Khalid Usman – 2016-10-04T13:26:58.707

Algorithm-wise: what would you recommend ? – user1769197 – 2014-07-03T21:59:30.367

you mean algorithm for computing the most similar resume vectors given a query job vector? you can use any standard algorithm such as BM25 or Language Model... – Debasis – 2014-07-04T11:35:38.297

I have never heard of these algorithms at all. Are these NLP algorithms or ML algo ? – user1769197 – 2014-07-04T13:32:43.553

these are standard retrieval models... a retrieval model defines how to compute the similarity between a document (resume in your case) and a query (job in your case). – Debasis – 2014-07-04T15:23:49.700

I have no knowledge about information retrieval, do you think machine learning algorithms like clustering / nearest neighbour will also work in my case ? – user1769197 – 2014-07-04T16:37:18.267

yes, nearest neighbour is in fact very similar in objective to that of information retrieval... that is that of finding the most similar k points given a query point... the only problem is that when N (the number of data points) is too large, IR is more efficient – Debasis – 2014-07-05T12:00:14.867


I work for an online jobs site and we build solutions to recommend jobs based on resumes. Our approach take's a person's job title (or desired job title if a student and known), along with skills we extract from their resume, and their location (which is very important to most people) and find matches with jobs based on that.

in terms of document classification, I would take a similar approach. I would recommend computing a tf idf matrix for each resume as a standard bag of words model, extracting just the person's job title and skills (for which you will need to define a list of skills to look for), and feed that into a ML algorithm. I would recommend trying knn, and an SVM, the latter works very well with high dimensional text data. Linear SVM's tend to do better than non-linear (e.g. using RBf kernels). If you have that outputting reasonable results, I would then play with extracting features using a natural language parser \ chunker, and also some custom built phrases matched by regex's.


Posted 2014-07-03T16:11:22.637

Reputation: 916

@Simon Can you write the complete steps for this recommendation system? I am having little experience (implement MS thesis) in ML, but totally new in IR field. Now I am working on this system and I wrote the following steps. 1. Use NLTK to extract keywords, 2. Calculate score for keywords and phrases, 3. Stemmer , 4. Categorization (the most challenging task) and 5. Frequency matrix , tf-idf or BM25 algo. Am i on the right way of implementation? Thanks – Khalid Usman – 2016-10-04T13:17:21.723

@KhalidUsman I can't tell you exactly how it works, that may get me in trouble. The easiest solution would be to put the data into Solr or Elastic Search and use their MLT recommender implementations. A more sophisticated approach is to extract key words and phrases, push the docs through LSA, and do k-nn on the resulting vectors. Then you may wish to use other signals such as Collaborative Filtering, and overall popularity. – Simon – 2016-10-16T19:27:38.220

@Simon, thanks for your guidance. I am applying the 2nd way, I have extracted keywords/keyphrases using RAKE+NLTK and after that i was planning to apply tf-idf or BM25. Am i right? Can you please elaborate the KNN way a little bit, i mean how to apply knn on keywords, should i make keywords as a features? Thanks – Khalid Usman – 2016-10-17T11:03:46.097

@KhalidUsman there is no right or wrong way. It's all a matter of having an objective measure of performance, trying different techniques and determining what works and what doesn't, being careful to use a separate test set to evaluate final performance. This repo I created may be of some help: https://github.com/DiceTechJobs/ConceptualSearch it does co location (phrase) extraction, and learns vectors over the phrases.

– Simon – 2016-11-02T14:33:52.363

@KhalidUsman using Rake (if it gives good results) and then BM25 would likely works well (but make sure you can measure performance - e.g. can you use it to predict what jobs a user may apply for, or click on based on historical data?). Re: Knn on keywords - produce a vector of tf.idf weighted keywords\ phrases for each document and compute cosine similarity. Then find the most similar documents. if you have some sort of labels, you can do some metric learning - max margin nearest neighbor, for instance. There is a python package for metric learning in the sci-kit family IIRC – Simon – 2016-11-02T14:37:12.487

Do you still use SVM when you have 3 or more classes ? And what features do you want to extract using a natural language parser? For what purpose ? – user1769197 – 2014-07-08T15:40:48.113

You can train n svm's for n classes using a one vs the rest strategy. SciKitLearn has code to do that automatically. Technically you need n-1 classifiers, but i've found having n works better. – Simon – 2015-05-18T19:54:36.913