parse pdf into Json or Xml

2

I want to create a neural net that can obtain some specific words from a pdf document into JSON or XML. For example let's assume that I have a pdf containing some information about countries and i want to recuperate the countries name and population to obtain something like this :

<countries>
  <country>
    <name>
      France
    </name
    <population>
      70m
    </population
  </country>
.
.
.
</countries>

Should I build a neural net and train it myself? If so can you give a good tutorial to follow please, or is there an already trained one that I can use?

H.Mateur

Posted 2018-08-18T02:01:49.167

Reputation: 21

Question was closed 2018-08-19T20:23:53.153

Answers

1

Well, Unless your goal is to build a neural net to solve the problem. This can be done in a much simpler way, Like in case of country name you can just check against a list of country names, and so on. At best some NLP could give you what you want. A neural net solution might be a little overkill.

If a neural net is compulsory, Then I think You could get a better answer if some details were specified. Are you looking for a fixed set of fields, what kind of text content do the pdfs contain etc.

Also just in case, if you were thinking a neural net will give you a json as output (just in case if you were thinking that). That will not be the case. you would have to convert it to json from the neural nets output, but that conversion stuff is very trivial, so i should not even be talking about that.

I know i have not answered your question. But i hope you got some direction.

Adithya Sama

Posted 2018-08-18T02:01:49.167

Reputation: 43

Thank you for your response, i picked country as an example i am trying to get financial records from a pdf (ebit, ebitda etc) , and yes i am looking for a fixed set of fields but the pdf contain a lot of information that i don't need, and i know that a neural net won't give json as an output. Is it possible to do with NLP? can you please give me a goot tutorial for it? – H.Mateur – 2018-08-19T01:12:39.617

Since these are well formatted documents you don't need NLP or NN, As you know what information is where and what it means. You can just write a parser for it, if some parser doesn't already exists (I couldn't find any) or just use some regex to fetch the data you want. – Adithya Sama – 2018-08-20T08:56:12.607

Yeah i know i could do it without NLP or NN, but they want me to do this work with an AI, i'll try to use NLP and NER to extract the information. Thank you @Adhithya – H.Mateur – 2018-08-20T09:29:44.740

If you have to do it in AI then i guess u can do it. But the results you get will never be better than a deterministic parser, think about it before you start writing code. If you are ok with the answer can you check it as the correct answer. – Adithya Sama – 2018-08-21T07:36:39.880

As i told you earlier it's not met that want to do it with AI but i was told to do it that way i don't have a say in this matter, anyway thank you @Adithya Sama – H.Mateur – 2018-08-21T17:57:39.723