Note that I am doing everything in R.
The problem goes as follow:
Basically, I have a list of resumes (CVs). Some candidates will have work experience before and some don't. The goal here is to: based on the text on their CVs, I want to classify them into different job sectors. I am particular in those cases, in which the candidates do not have any experience / is a student, and I want to make a prediction to classify which job sectors this candidate will most likely belongs to after graduation .
Question 1: I know machine learning algorithms. However, I have never done NLP before. I came across Latent Dirichlet allocation on the internet. However, I am not sure if this is the best approach to tackle my problem.
My original idea: make this a supervised learning problem. Suppose we already have large amount of labelled data, meaning that we have correctly labelled the job sectors for a list of candidates. We train the model up using ML algorithms (i.e. nearest neighbor... )and feed in those unlabelled data, which are candidates that have no work experience / are students, and try to predict which job sector they will belong to.
Update Question 2: Would it be a good idea to create an text file by extracting everything in a resume and print these data out in the text file, so that each resume is associated with a text file,which contains unstructured strings, and then we applied text mining techniques to the text files and make the data become structured or even to create a frequency matrix of terms used out of the text files ? For example, the text file may look something like this:
I deployed ML algorithm in this project and... Skills: Java, Python, c++ ...
This is what I meant by 'unstructured', i.e. collapsing everything into a single line string.
Is this approach wrong ? Please correct me if you think my approach is wrong.
Question 3: The tricky part is: how to identify and extract the keywords ? Using the
tm package in R ? what algorithm is the
tm package based on ? Should I use NLP algorithms ? If yes, what algorithms should I look at ? Please point me to some good resources to look at as well.
Any ideas would be great.