I have thousands of CV / resumes with me. We want to build a parser which can extract company names from resume.
So far we have tried
Maintained a list of common words present in companies (Eg. Org, Ltd, Limited, Technologies etc.) and use them to identify probable companies. But this list is limited and many times many companies don't get extracted.
Using HTML of CV we have tried to give more score to probable companies which have a certain feature (like Bold, Italics)
Since CV is not only text and we always have some structural information along with it. There should be better ways to extract information. Maybe training some model which could predict companies mentioned in the resume. We are open to any better approaches/suggestions we could incorporate into our system for better accuracy. The precision so far is really bad (less than 45%).
We have already done the segmentation of work experience in CV. So we are able to extract the segment containing work experience with very high precision.
We also have a comprehensive list of companies (millions). Although it contains duplicates and needs significant amount of cleaning. But yes we have a lot of data
Other approaches we are trying - We try to predict the important phrases inside text using N-Grams and then mark them as probable companies. Then we do a search on companies corpus with us to find any match. How useful is this technique ? Any better approaches ?