How to extract important phrases (which may contain company name) from resume?

4

1

I have thousands of CV / resumes with me. We want to build a parser which can extract company names from resume.

So far we have tried

  1. Maintained a list of common words present in companies (Eg. Org, Ltd, Limited, Technologies etc.) and use them to identify probable companies. But this list is limited and many times many companies don't get extracted.

  2. Using HTML of CV we have tried to give more score to probable companies which have a certain feature (like Bold, Italics)

Since CV is not only text and we always have some structural information along with it. There should be better ways to extract information. Maybe training some model which could predict companies mentioned in the resume. We are open to any better approaches/suggestions we could incorporate into our system for better accuracy. The precision so far is really bad (less than 45%).

We have already done the segmentation of work experience in CV. So we are able to extract the segment containing work experience with very high precision.

We also have a comprehensive list of companies (millions). Although it contains duplicates and needs significant amount of cleaning. But yes we have a lot of data

Edit

Other approaches we are trying - We try to predict the important phrases inside text using N-Grams and then mark them as probable companies. Then we do a search on companies corpus with us to find any match. How useful is this technique ? Any better approaches ?

khirod

Posted 2015-11-26T13:07:02.697

Reputation: 41

Define accuracy? – paparazzo – 2015-11-26T14:01:16.503

Number of correctly parsed companies / Total predicted companies by the parser – khirod – 2015-11-26T14:55:47.947

That is precision. – paparazzo – 2015-11-26T15:02:01.380

Answers

3

Sounds like you want named entity recognition. There are a variety of approaches to NER, and plenty of implementations, like the Stanford NER package.

After you find named entities, determining what the named entity refers to is called concept normalization.

jamesmf

Posted 2015-11-26T13:07:02.697

Reputation: 2 927

Thanks. But I think Stanford NER uses CRF and that is particularly useful when you want to predict Named Entities using context of surrounding words. In a resume that is not the case. – khirod – 2015-11-27T07:24:52.290

Infact how useful will be NER when the text is not completely unstructured. Resumes have some structural information associated with them. – khirod – 2015-11-27T08:27:51.820

Resumes aren't written in complete sentences (most of the time). So standard NLP methods aren't a good match. – khirod – 2015-11-27T12:42:53.193

That's an interesting point - though if you have meaningful structure, why do you need this extraction? I'll think about the no-full-sentences hurdle – jamesmf – 2015-11-27T13:27:01.477

The problem is the structure varies a lot. Infact I should say that resumes are poorly structured. We have tried to extract HTML using Apache Tika and bold, italic, tables, bullets of text are intact but visual formatting gets mangled.

By structure I mean to say that you can identify visually similar information from a resume. And guess that one part of text is infact a company based on visually similar structure. – khirod – 2015-11-27T13:36:03.647

A visual approach using OCR is interesting. You could also probably simply hit an entity resolution api or Freebase for each word that is capitalized and see if any institutions/companies come back as the first result – jamesmf – 2015-11-27T18:18:40.590

0

Have you tried the XML package? In a similar question here, in SE, the most upvoted answer suggested using some packages for that.

Here: https://stackoverflow.com/questions/3195522/is-there-a-simple-way-in-r-to-extract-only-the-text-elements-of-an-html-page

Here you find further instructions: https://stackoverflow.com/questions/1844829/how-can-i-read-and-parse-the-contents-of-a-webpage-in-r

Gustavo Silva

Posted 2015-11-26T13:07:02.697

Reputation: 1

The HTML is obtained from Apache Tika. And it varies from resume to resume. There is no fixed structure of HTML. – khirod – 2015-11-26T14:57:12.573